Query Translation for Multilingual Content with Semantic Technique - UKM

Page created by Monica Goodman
 
CONTINUE READING
Sains Malaysiana 49(9)(2020): 2113-2118
http://dx.doi.org/10.17576/jsm-2020-4909-09

           Query Translation for Multilingual Content with Semantic Technique
                 (Terjemahan Pertanyaan untuk Kandungan Pelbagai Bahasa dengan Teknik Semantik)

                 N ORITA M D N ORWAWI*, S UNDRESAN A/L P ERUMAL, E MRAN H UDA & W AKA J ENG

                                                      ABSTRACT
Cross-lingual information retrieval (CLIR)
resources. Thus, translation is the key element in the query processing. There are three translation approaches: query,
document, or hybrid query-document. However, query translation is very challenging due to the polysemy problem.

misinterpreted. This paper presents a semantic technique on query translation for a multilingual knowledge repository

language including Jawi
prophetic food. These keywords were annotated with the relevant Quranic verses, Hadith texts, Manuscript text images

documents. A one-stop knowledge repository on prophetic food was developed as a proof of concept using sources are

and integrity.
Keywords: Cross lingual information retrieval; one stop knowledge repository; prophetic food; query translation;
semantic technique

                                                      ABSTRAK
Dapatan semula maklumat silang bahasa (CLIR) membolehkan pertanyaan pengguna diajukan dalam bahasa yang
berbeza daripada bahasa bahan sumber sasaran. Oleh itu, terjemahan menjadi kunci utama dalam pemprosesan
pertanyaan. Terdapat 3 jenis pendekatan terjemahan: terjemahan pertanyaan, dokumen atau pertanyaan-dokumen
hibrid. Walau bagaimanapun, terjemahan pertanyaan adalah mencabar berpunca daripada masalah polisemi. Gaya
linguistik pelbagai bahasa yang berbeza menimbulkan kesamaran makna yang menyebabkan hasrat sebenar pengguna
boleh disalah tafsir. Kajian ini membentangkan teknik semantik terjemahan pertanyaan repositori pelbagai bahasa
untuk menambahbaik pemprosesan pertanyaan. Dokumen sumber yang diterjemahkan secara manual atau corpora selari
dalam Bahasa Inggeris, Arab dan Melayu termasuk teks Jawi digunakan sebagai data kajian. Set kata kunci telah
dikenal pasti oleh pakar bidang berkaitan dengan makanan sunnah. Kata kunci ini dianotasikan dengan ayat-ayat

Perkataan sinonim dan terjemahan secara konteks dianotasikan juga kepada kata kunci berkaitan. Setiap pertanyaan
akan menggunakan 3 kaedah pemadanan ke atas senarai indeks kata kunci yang akan menghubungkan kepada
dokumen yang relevan. Repositori pengetahuan sehenti berkaitan makanan sunnah dibangunkan sebagai bukti konsep

bidang untuk menjamin kesahihan dan integriti.
Kata kunci: Dapatan semula maklumat silang bahasa; makanan sunnah; repositori pengetahuan sehenti; teknik
semantik; terjemahan pertanyaan

                    INTRODUCTION                             mining, and clustering and Web information retrieval.
                                                             However, today , due to the diversity of information and
With the advancement of computer, Internet and
                                                             language barriers, keeping up with communication and
communication technology, information access has
                                                             cultural interchange across the globe is very challenging.
advanced into many tools and applications such as the
                                                             As for the IR, there are 4 categories of retrieval which
information retrieval (IR) , question answering tasks,
                                                             are: mono-lingual, bi-lingual, cross-lingual (CL), and
summarization, multimedia information retrieval, text
2114

multi-lingual (ML). Cross-language and multi-lingual IR                                                               CLIR and
                                                                        MLIR becomes more important due to the rapid progress of
language to retrieve documents from another language.                   Internet technology that opens the boundary of language
The query is posed in the source language and the language              and culture. Hence, translation is the key signature
of the relevant document is the target language.                        task in the query processing due to the multilingual
      On the other hand, handling heterogenous                          nature. There are three main approaches in handling the
knowledge sources is challenging due to the diversity                   translation which are: document, query or hybrid query
of perspectives and context of the knowledge. The level                 document (Agbele et al. 2018) as shown in Table 1.
of complexity increases since these resources are in

                                    TABLE 1. Comparison of the three translation approach (Agbele et al. 2018)

         SN          Parameter                  Query Translation        Document Translation    Query-Document Translation

          1          Extra storage space        Not needed               Needed                  Not needed
                                                                                                 More than both query and
          2          Ambiguity                  maximum                  Minimum
                                                                                                 document
          3          Information Retrieval      Bilingual                Bilingual               Bilingual and Multilingual
                                                                                                 More than both query and
          4          Transition time            Less                     More than query
                                                                                                 document
          5          Flexibility                Highly                   Less                    Less

                  TABLE 2. Comparison of the three translation approach (Agbele et al. 2018; Jena & Rautaray, 2019)

       Translation Approach        Dictionary Based                  Corpora Based           Machine Translation Based
       Ambiguity                   High                              Low                     Low

                                   Possible                          Possible                Not Possible

       Working Architecture        Visible as like white box testing Visible as like white   Works similar to black box testing

                                                                     box testing

       Development                 Less expensive                    More expensive than     More expensive
       expenses                                                      DBT

       Translation Availabilty Highly available in many              Available only in few   Available only in few languages

                                   languages                         languages

      Based on Table 1, Agbele et al. (2018) highlighted                      Synonymy refers to multiple words with common
that the query-document translation approach does not                   meaning whereas polysemy refers to words with
require extra storage space but have issues related to                  multiple meanings. Synonymous and polysemous words
language ambiguity and speed due to synonymy and                        will reduce the recall and precision rates (Azad & Deepak
polysemy problem.
2115

ambiguity will lead to misinterpretation of user’s true
intention especially for translation. Hence, processing          date and goat’s milk were selected as the scope of the
a query and its translation for CLIR and MLIR with               study. Begin with content analysis and systematic
heterogeneous knowledge resources ensuring right                 literature review conducted by domain experts in Quran
context and diverse perspectives is a challenge (Abusalah        and Sunnah studies, Manuscript on Malay Remedies,
et al. 2005; Aldhlan et al. 2010; Elayeb & Bournas               Food Science and Health Sciences. All relevant
2016; Sharmal & Morwal 2015).
      As for the query translation, there are three sources
of knowledge: dictionary, corpora, or machine learning.          taxonomy on dates and prophetic food was build based
Table 2 illustrates the comparison of these techniques           on the literature, concept, and relations to the Quranic,
(Agbele et al. 2018; Jena & Rautaray 2019).
      In this study, corpora-based query translation was
used which has low ambiguity than dictionary based,              a structural connection was designed for the specialized
and moderate development cost compared to machine                search engine on prophetic food which later implemented
learning based translation as presented in Table 2.              as a web and mobile application called NAISSE.
      This paper presents a semantic technique for query
translation to handle the cross-lingual ambiguity of the                         RESULTS AND DISCUSSION
diverse resources where query and targeted text are
possibly in Arabic, Malay language, or English. Users            Based on the knowledge collected, a taxonomy of
may pose query in English, Arabic or Malay to search             knowledge was constructed on dates and goat’s milk
for relevant text in Al-Quran, Hadith, classical Malay           linking to the resources that map these knowledge
manuscript in Roman or Jawi
related to prophetic food. Search results will embed
the direct match with the semantic technique of using            tagged with the translation in the other two languages,
translation, synonym and context based on meaning as             its synonym or contextual meaning. For example, kurma
determined by expert.                                            in Bahasa Malaysia could mean ‫ﻡﺕ ﺭ‬, ‫ﺏﻁﺭ‬, ‫ ﻝﺥﻥ‬in
                                                                 Arabic. The Quranic, Hadith, and classical manuscript
                MATERIALS AND METHODS
A knowledge repository was prepared consists of                  and its details including abstracts. Figure 1 illustrates the
Quranic verses, Hadith text, classical manuscript text,          representation of the knowledge repository.

                                    Keywords                                                      Multilingual
                                       List                   Direct                              Resources
        Query
                                                              Match

                                                          Translate

                                                              Context

                                                               Synonym &                              Result
                         Arabic     English    Malay            Polysemy

                FIGURE FIGURE
                       1. Schematic   diagram
                              1. Schematic diagramof
                                                   of the  multilingual
                                                      the multilingual     knowledge
                                                                       knowledge repositoryrepository
2116

       In order to resolve the linguistic ambiguity issue due                        and resources. Each keyword will record the list of relevant
      In order to resolve the linguistic ambiguity issue due                         by expert relevant to the scope and resources. Each
 to query translation, a semantic technique was proposed as                                            ID of the Quranic verses, Hadith text, manuscript
to query translation, a semantic technique was proposed                                                                                                          ID
 a mechanism that will accept the query in either of the three
as  a mechanism
 languages         that will
            then matched      accept
                           either    the query
                                  exactly, based in
                                                 on either of the
                                                    translation or                   of the ofQuranic
                                                                                     index             verses,
                                                                                               prophetic       Hadith
                                                                                                         food names    text,with
                                                                                                                    tagged   manuscript     snippet
                                                                                                                                 its translation and
three  languages    then  matched    either  exactly,  based   on
 context match to the relevant documents. Keywords list contains
translation or context match to the relevant documents.                              index Referring
                                                                                           of prophetic foodtonames
                                                                                                      also          tagged
                                                                                                                Figure     withquery
                                                                                                                       1, the   its translation
                                                                                                                                       will be
Keywords list contains possible keyword identified

                           Word1                                         

                              Quran ID1        Quran ID2        Quran ID3          …………..            Quran IDn

                                                Hadith ID1       Hadith ID2       Hadith ID3         …………..           Hadith IDn

                                                       Manuscript D1          Manuscript ID2        Manuscript ID3         …       Manuscript IDn
                                                                                                                           …

                                                           Ar cle ID1         Ar cle ID2         Ar cle ID3        …………..          Ar cle IDn

                           Word1          Word1, English     Word1, Arabic     Word1, Malay       Scien fic          Synonym          Context
                                                                                                  Names

                        FIGURE 2. Illustration
                                       FIGURE of  keywords oflist
                                               2. Illustration    structure
                                                               keywords  listlinked  tolinked
                                                                              structure prophetic
                                                                                              to food list
                                                                         prophetic food list

       Referring also to Figure 1, the query will be                                  Dictionary Based Translation Match T
                                                                                      Dictionary Based Translation Match T
                                                                                          As for the dictionary-based translation T,
 match (D),
         (D), dictionary-based
               dictionary-based match   match for
                                                for the
                                                     the translation
                                                          translation (T)
                                                                       (T)               As for the dictionary-based translation T,
match
 and   context-based       match      (C) through
and context-based match (C) through its synonyms (S)  its synonyms     (S)                  T = {w| w∈Q} ∪ {w| w∈H} ∪ {w| w∈J} for query
 and   semantic    (M                                                                      Tmm= {w| w Q} {w| w H}     {w| w J} for query
and semantic (M
      The semantic
 O. The     semantic technique
                          technique isis represented
                                             represented formally
                                                             formally asas           in
O.                                                                                    in Malay
                                                                                         Malay language,
                                                                                                language, m
                                                                                                          m
 follows:
follows:
 Assume L
Assume      L is
              is the
                  the set
                        setofofallallmatching
                                       matchingwords,
                                                  words,     from
                                                          w, w,    all the
                                                                from   all                      = {w| w∈S } ∪{w|
                                                                                                              {w|ww∈S
                                                                                           TT a= {w| w S }r        Sj}j} ∪ {w|
                                                                                                                            {w|ww∈J}  forquery
                                                                                                                                  J} for  query
 resources    Quran     (Q),   Hadith     (H),  Malay
the resources Quran (Q), Hadith (H), Malay manuscript     manuscript    in                  a           r

 roman    text (S
in roman text (S    ), Malay      manuscript     in  Jawi   text
                   r r), Malay manuscript in Jawi text (Sjj) and
                                                                 (S ) and            in
                                                                                      inArabic
                                                                                        Arabic language,
                                                                                                language, aa
                              Thus,
                         J).Thus,
                         J).
 L  = {w|  w   D}        {w|
L = {w| w∈D} ∪ {w| w∈T}∪{w|   w T} {w| w∈C}             where LL are
                                               w C} where         are the
                                                                       the
                                                                                              TTe e=={w|
                                                                                                      {w|ww∈Q}   {w|ww∈H}
                                                                                                            Q} ∪{w|    H} ∪{w|   Srr}} ∪ {w|
                                                                                                                               w∈S
                                                                                                                           {w| w
 list of matched     keywords
list of matched keywords such that   such  that
                                                                                              w{w|Sjw∈S
                                                                                                    } forj}query
                                                                                                            for query in English
                                                                                                                 in English      language,
                                                                                                                            language, e ande and
 Direct Match
 Direct  Match DD
     DDa == {w|
         a
            {w| w∈Q}∪{w|
                w Q} {w| w∈H}  for Arabic
                         w H} for  Arabictext
                                          text                                                T=   {w|l |wwl ∈TTa}}∪{
                                                                                               T= {w
                                                                                                     l     l   a
                                                                                                                    { wwl l| |wwl l ∈TTmm}} &&ll≠≠a,a,ll≠≠mm⊕
                                                                                                                                                            ⊕

     Dm == {w|
             {w| w∈S
                  w S}∪{w|
                        } {w| w∈S
                               w S}j} for
                                      for Roman
                                          Roman or
                                                or Jawi
                                                   Jawi                                             {wl| |wwl ∈TTa}}∪{
    D m              rr           j                                                                {w                { wwl |l |wwl l ∈TTe e}}l l≠≠a,a,l l≠≠ee⊕
                                                                                                                                                             ⊕
 Malay  text
 Malay text
                                                                                                      l    l    a

        De = {w| w J} for English text                                                             {w|l |wwl ∈TTm}}∪{
                                                                                                  {w               { wwl |l |wwl l ∈TT   } l ≠ m, l ≠ e
                                                                                                                                       e} l ≠ m, l ≠ e
                                                                                                    l     l    m                     e
       D    = {w|
        thus,
          e       = Da for
               D w∈J}   ⊕ D English
                               ⊕ Dtextimplies only an exact
                             m      e
       thus,   D  = D
        match with the  ⊕  D   ⊕  De implies
                      a samemlanguage
                                              only
                                       as query     an source
                                                or the  exact                         wherewhere  T is
                                                                                            T is the setthe set of translation
                                                                                                         of translation        of the keyword
                                                                                                                        of the keyword          wl
                                                                                                                                        wl in other
       match   with
        language.   the same language  as query or the source                        in other
                                                                                      languagelanguage
       language.
2117

Context based match C                                          on prophetic food later developed as mobile apps. The
                                                               knowledge sources are from al-Quran and Hadith which
                                                               are in Arabic, classical manuscript in Malay language
       C = {ws| ws ∈N} ∪ {wm| wm ∈O}                           either written in Roman or Jawi
                                                               which are mainly in English manage using a content
where ws are words matched in the synonym list, N and          management system (Norwawi et al. 2019). Results from
wm are words matched from the knowledge taxanomy, O.           NAISSE search engine on Quran were also compared to
                                                               https://quran.com, a search engine for Al-Quran as shown
     A CLIR application was developed called NAISSE
                                                               in Table 3.
(naisse.org). It is an online one-stop knowledge repository

                           TABLE 3. Comparison search results between quran.com and naisse.org

       Query        No. of verses   No. of verses    Similar      Synonym           Context             Spelling          Duplicate
   (Single word)      matched       matched from     verses       naisse.org       naisse.org       Variety (Arabic)       verse
                    from quran.      naisse.org                                                       naisse.org
                        com
 Kurma (Malay)            26             29            19              1                4                   4                  1

 Dates (English)          30             29            17              4                7                   15                 0

 ‫ليخنلا‬
 (Arabic with             1              29             1              5                4                   18                 1
 Diacritic)
 ‫بطر‬
 (Arabic without          2              3              1              0                0                   2                  0
 diacritic)

     Based on Table 3, the results show that naisse.           for example ‫ليخن‬, ‫بطر‬, ‫ف‬   َ ‫ت‬ِ ‫اًلي‬, ‫ص‬   ِ ‫ٌناَوْن‬, ‫قَن‬  ِ ‫ اًري‬and in
org is comparable to quran.com for Malay and English           context of meaning ‫ش‬        َ ِ‫ي‬
                                                                                  َ ‫ط ٍةَرَج‬  ّ ‫ب‬
                                                                                                َ ‫ٍة‬, ‫ق‬ ْ ‫ٍريِم‬, ‫ط‬
                                                                                                      ِ ‫ط‬          َ. ‫عْل‬
                                                                                                                        ٌ ‫ضَن‬ ِ ‫ ٌدي‬and
word except in Arabic. Upon examining and comparing            ‫ٍةَنيِل ْنِم‬.
the results, Arabic language source of ambiguity is due

to grammatical syntax rules. However, this could be                            CONCLUSION
overcome using translation, synonym, and context.              This study demonstrated the interconnection of
For example, if the query is Kurma which is in Malay
language, there are 6 exact matches with the Malay             such as Al-Quran, Hadith, Classical Manuscript, and
translation of the relevant Quranic verses, 1 synonym
which is Tamar and the rest are either the translation         a one-stop repository for easy access. A multilingual
into English and Arabic or context based on the synonym        knowledge repository, NAISSE, developed with three
or semantic as determined by expert. Query translation         major keyword matching mechanism: direct match,
using the parallel corpora is simple due to the translation    dictionary-based translation and context based through
already available for all the text in its respective           synonym or its semantic represented by the knowledge
documents as also recommended by Prasath et al. (2015)         taxonomy, structure of the keyword and resources
who use Tamil and English in their study. This study                          NAISSE will match the keyword search
will continue with further improvement in handling
                                                               with multi documents using a semantic technique for
the Arabic language as evident in Table 3. There are
                                                               its query translation based on the translation, synonym
varieties of terms of Arabic related to Kurma and Dates
2118

and semantic. It can be expandable to other languages by           Elayeb, B. & Bournas, I. 2016. Arabic cross-language information
developing the linguistic matching algorithm that suits                retrieval: a review. ACM Transactions on Asian and Low-
the language.                                                          Resource Language Information Processing (TALLIP) 15(3):
      There are lots of potential for the future works                 1-44.
                                                                   Jena, G.C. & Rautaray, S.S. 2019. A comprehensive survey on
since this is an initial study for MLIR using Arabic,
                                                                       cross-language information retrieval system. Indonesian
English, Malay including the Jawi text using data from
                                                                       Journal of Electrical Engineering and Computer Science
Quran, Hadith, classical manuscript and articles related               14(1): 127-134.
                                                                   Norwawi, N.M., Perumal, S., Sempo, M.W., Huda, E. & Jeng, W.
Future study may take advantage of the simple approach                 2019. Multi-lingual content management system for prophetic
                                                                       food. In Proceedings of the International Conference on
query translation with semi-automated generation of                    Islamic Applications in Computer Science and Technologies
                                                                       (IMAN 2019). 27(28).
terms of search results, it can be improved by focusing            Prasath, R., Sarkar, S. & O’Reilly, P. 2015. Improving cross
on reducing the language ambiguity through parallel                    language information retrieval using corpus based query
corpora based and handing the grammatical aspect of the                suggestion approach. In International Conference on
                                                                       Intelligent Text Processing and Computational Linguistics.
language especially Arabic.
                                                                       Springer, Cham. pp. 448-457.
                                                                   Sharma, M. & Morwal, S. 2015. A survey on cross-language
                   ACKNOWLEDGEMENTS
                                                                       information retrieval. International Journal of Advanced
The authors express their appreciation for the Niche                   Research in Computer and Communication Engineering
Research Grant Scheme, awarded by the Ministry of                      4(2): 384-387.
Education to Universiti Sains Islam Malaysia (USIM)                Tawil, S.F.M., Ismail, R., Wahid, F.A., Norwawi, N.M. & Mazlan,
with project code NRGS-P/FST/8404/52113.                               A.A. 2016. Application of OASys approaches for dates
                                                                       ontology. In Third International Conference on Information
                        REFERENCES                                     Retrieval and Knowledge Management (CAMP). IEEE. pp.
Abusalah, M., Tait, J. & Oakes, M. 2005. Literature review of          131-135.
   cross-language information retrieval. World Academy of
   Science, Engineering and Technology 4: 175-177.                 Faculty of Science and Technology
Agbele, K.K., Ayetiran, E.F. & Aruleba, K.D. 2018. Survey on       Universiti Sains Islam Malaysia
   cross-lingual information retrieval. International Journal of   71800 Nilai, Negeri Sembilan Darul Khusus
                                        9(8): 484-491.             Malaysia
Aldhlan, K.A., Zeki, A.M. & Zeki, A.M. 2010. Datamining and
   Islamic knowledge extraction: Alhadith as a knowledge           *Corresponding author; email: norita@usim.edu.my
   resource. In Proceedings of the International Conference
   on Information and Communication Technology for Muslim          Received: 23 January 2020
   World (ICT4M). IEEEE. H-21.                                     Accepted: 1 April 2020
Azad, H.K. & Deepak, A. 2019. Query expansion techniques for
   information retrieval: A survey. Information Processing &
   Management 56(5): 1698-1735.
You can also read