Query Translation for Multilingual Content with Semantic Technique - UKM
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Sains Malaysiana 49(9)(2020): 2113-2118 http://dx.doi.org/10.17576/jsm-2020-4909-09 Query Translation for Multilingual Content with Semantic Technique (Terjemahan Pertanyaan untuk Kandungan Pelbagai Bahasa dengan Teknik Semantik) N ORITA M D N ORWAWI*, S UNDRESAN A/L P ERUMAL, E MRAN H UDA & W AKA J ENG ABSTRACT Cross-lingual information retrieval (CLIR) resources. Thus, translation is the key element in the query processing. There are three translation approaches: query, document, or hybrid query-document. However, query translation is very challenging due to the polysemy problem. misinterpreted. This paper presents a semantic technique on query translation for a multilingual knowledge repository language including Jawi prophetic food. These keywords were annotated with the relevant Quranic verses, Hadith texts, Manuscript text images documents. A one-stop knowledge repository on prophetic food was developed as a proof of concept using sources are and integrity. Keywords: Cross lingual information retrieval; one stop knowledge repository; prophetic food; query translation; semantic technique ABSTRAK Dapatan semula maklumat silang bahasa (CLIR) membolehkan pertanyaan pengguna diajukan dalam bahasa yang berbeza daripada bahasa bahan sumber sasaran. Oleh itu, terjemahan menjadi kunci utama dalam pemprosesan pertanyaan. Terdapat 3 jenis pendekatan terjemahan: terjemahan pertanyaan, dokumen atau pertanyaan-dokumen hibrid. Walau bagaimanapun, terjemahan pertanyaan adalah mencabar berpunca daripada masalah polisemi. Gaya linguistik pelbagai bahasa yang berbeza menimbulkan kesamaran makna yang menyebabkan hasrat sebenar pengguna boleh disalah tafsir. Kajian ini membentangkan teknik semantik terjemahan pertanyaan repositori pelbagai bahasa untuk menambahbaik pemprosesan pertanyaan. Dokumen sumber yang diterjemahkan secara manual atau corpora selari dalam Bahasa Inggeris, Arab dan Melayu termasuk teks Jawi digunakan sebagai data kajian. Set kata kunci telah dikenal pasti oleh pakar bidang berkaitan dengan makanan sunnah. Kata kunci ini dianotasikan dengan ayat-ayat Perkataan sinonim dan terjemahan secara konteks dianotasikan juga kepada kata kunci berkaitan. Setiap pertanyaan akan menggunakan 3 kaedah pemadanan ke atas senarai indeks kata kunci yang akan menghubungkan kepada dokumen yang relevan. Repositori pengetahuan sehenti berkaitan makanan sunnah dibangunkan sebagai bukti konsep bidang untuk menjamin kesahihan dan integriti. Kata kunci: Dapatan semula maklumat silang bahasa; makanan sunnah; repositori pengetahuan sehenti; teknik semantik; terjemahan pertanyaan INTRODUCTION mining, and clustering and Web information retrieval. However, today , due to the diversity of information and With the advancement of computer, Internet and language barriers, keeping up with communication and communication technology, information access has cultural interchange across the globe is very challenging. advanced into many tools and applications such as the As for the IR, there are 4 categories of retrieval which information retrieval (IR) , question answering tasks, are: mono-lingual, bi-lingual, cross-lingual (CL), and summarization, multimedia information retrieval, text
2114 multi-lingual (ML). Cross-language and multi-lingual IR CLIR and MLIR becomes more important due to the rapid progress of language to retrieve documents from another language. Internet technology that opens the boundary of language The query is posed in the source language and the language and culture. Hence, translation is the key signature of the relevant document is the target language. task in the query processing due to the multilingual On the other hand, handling heterogenous nature. There are three main approaches in handling the knowledge sources is challenging due to the diversity translation which are: document, query or hybrid query of perspectives and context of the knowledge. The level document (Agbele et al. 2018) as shown in Table 1. of complexity increases since these resources are in TABLE 1. Comparison of the three translation approach (Agbele et al. 2018) SN Parameter Query Translation Document Translation Query-Document Translation 1 Extra storage space Not needed Needed Not needed More than both query and 2 Ambiguity maximum Minimum document 3 Information Retrieval Bilingual Bilingual Bilingual and Multilingual More than both query and 4 Transition time Less More than query document 5 Flexibility Highly Less Less TABLE 2. Comparison of the three translation approach (Agbele et al. 2018; Jena & Rautaray, 2019) Translation Approach Dictionary Based Corpora Based Machine Translation Based Ambiguity High Low Low Possible Possible Not Possible Working Architecture Visible as like white box testing Visible as like white Works similar to black box testing box testing Development Less expensive More expensive than More expensive expenses DBT Translation Availabilty Highly available in many Available only in few Available only in few languages languages languages Based on Table 1, Agbele et al. (2018) highlighted Synonymy refers to multiple words with common that the query-document translation approach does not meaning whereas polysemy refers to words with require extra storage space but have issues related to multiple meanings. Synonymous and polysemous words language ambiguity and speed due to synonymy and will reduce the recall and precision rates (Azad & Deepak polysemy problem.
2115 ambiguity will lead to misinterpretation of user’s true intention especially for translation. Hence, processing date and goat’s milk were selected as the scope of the a query and its translation for CLIR and MLIR with study. Begin with content analysis and systematic heterogeneous knowledge resources ensuring right literature review conducted by domain experts in Quran context and diverse perspectives is a challenge (Abusalah and Sunnah studies, Manuscript on Malay Remedies, et al. 2005; Aldhlan et al. 2010; Elayeb & Bournas Food Science and Health Sciences. All relevant 2016; Sharmal & Morwal 2015). As for the query translation, there are three sources of knowledge: dictionary, corpora, or machine learning. taxonomy on dates and prophetic food was build based Table 2 illustrates the comparison of these techniques on the literature, concept, and relations to the Quranic, (Agbele et al. 2018; Jena & Rautaray 2019). In this study, corpora-based query translation was used which has low ambiguity than dictionary based, a structural connection was designed for the specialized and moderate development cost compared to machine search engine on prophetic food which later implemented learning based translation as presented in Table 2. as a web and mobile application called NAISSE. This paper presents a semantic technique for query translation to handle the cross-lingual ambiguity of the RESULTS AND DISCUSSION diverse resources where query and targeted text are possibly in Arabic, Malay language, or English. Users Based on the knowledge collected, a taxonomy of may pose query in English, Arabic or Malay to search knowledge was constructed on dates and goat’s milk for relevant text in Al-Quran, Hadith, classical Malay linking to the resources that map these knowledge manuscript in Roman or Jawi related to prophetic food. Search results will embed the direct match with the semantic technique of using tagged with the translation in the other two languages, translation, synonym and context based on meaning as its synonym or contextual meaning. For example, kurma determined by expert. in Bahasa Malaysia could mean ﻡﺕ ﺭ, ﺏﻁﺭ, ﻝﺥﻥin Arabic. The Quranic, Hadith, and classical manuscript MATERIALS AND METHODS A knowledge repository was prepared consists of and its details including abstracts. Figure 1 illustrates the Quranic verses, Hadith text, classical manuscript text, representation of the knowledge repository. Keywords Multilingual List Direct Resources Query Match Translate Context Synonym & Result Arabic English Malay Polysemy FIGURE FIGURE 1. Schematic diagram 1. Schematic diagramof of the multilingual the multilingual knowledge knowledge repositoryrepository
2116 In order to resolve the linguistic ambiguity issue due and resources. Each keyword will record the list of relevant In order to resolve the linguistic ambiguity issue due by expert relevant to the scope and resources. Each to query translation, a semantic technique was proposed as ID of the Quranic verses, Hadith text, manuscript to query translation, a semantic technique was proposed ID a mechanism that will accept the query in either of the three as a mechanism languages that will then matched accept either the query exactly, based in on either of the translation or of the ofQuranic index verses, prophetic Hadith food names text,with tagged manuscript snippet its translation and three languages then matched either exactly, based on context match to the relevant documents. Keywords list contains translation or context match to the relevant documents. index Referring of prophetic foodtonames also tagged Figure withquery 1, the its translation will be Keywords list contains possible keyword identified Word1 Quran ID1 Quran ID2 Quran ID3 ………….. Quran IDn Hadith ID1 Hadith ID2 Hadith ID3 ………….. Hadith IDn Manuscript D1 Manuscript ID2 Manuscript ID3 … Manuscript IDn … Ar cle ID1 Ar cle ID2 Ar cle ID3 ………….. Ar cle IDn Word1 Word1, English Word1, Arabic Word1, Malay Scien fic Synonym Context Names FIGURE 2. Illustration FIGURE of keywords oflist 2. Illustration structure keywords listlinked tolinked structure prophetic to food list prophetic food list Referring also to Figure 1, the query will be Dictionary Based Translation Match T Dictionary Based Translation Match T As for the dictionary-based translation T, match (D), (D), dictionary-based dictionary-based match match for for the the translation translation (T) (T) As for the dictionary-based translation T, match and context-based match (C) through and context-based match (C) through its synonyms (S) its synonyms (S) T = {w| w∈Q} ∪ {w| w∈H} ∪ {w| w∈J} for query and semantic (M Tmm= {w| w Q} {w| w H} {w| w J} for query and semantic (M The semantic O. The semantic technique technique isis represented represented formally formally asas in O. in Malay Malay language, language, m m follows: follows: Assume L Assume L is is the the set setofofallallmatching matchingwords, words, from w, w, all the from all = {w| w∈S } ∪{w| {w|ww∈S TT a= {w| w S }r Sj}j} ∪ {w| {w|ww∈J} forquery J} for query resources Quran (Q), Hadith (H), Malay the resources Quran (Q), Hadith (H), Malay manuscript manuscript in a r roman text (S in roman text (S ), Malay manuscript in Jawi text r r), Malay manuscript in Jawi text (Sjj) and (S ) and in inArabic Arabic language, language, aa Thus, J).Thus, J). L = {w| w D} {w| L = {w| w∈D} ∪ {w| w∈T}∪{w| w T} {w| w∈C} where LL are w C} where are the the TTe e=={w| {w|ww∈Q} {w|ww∈H} Q} ∪{w| H} ∪{w| Srr}} ∪ {w| w∈S {w| w list of matched keywords list of matched keywords such that such that w{w|Sjw∈S } forj}query for query in English in English language, language, e ande and Direct Match Direct Match DD DDa == {w| a {w| w∈Q}∪{w| w Q} {w| w∈H} for Arabic w H} for Arabictext text T= {w|l |wwl ∈TTa}}∪{ T= {w l l a { wwl l| |wwl l ∈TTmm}} &&ll≠≠a,a,ll≠≠mm⊕ ⊕ Dm == {w| {w| w∈S w S}∪{w| } {w| w∈S w S}j} for for Roman Roman or or Jawi Jawi {wl| |wwl ∈TTa}}∪{ D m rr j {w { wwl |l |wwl l ∈TTe e}}l l≠≠a,a,l l≠≠ee⊕ ⊕ Malay text Malay text l l a De = {w| w J} for English text {w|l |wwl ∈TTm}}∪{ {w { wwl |l |wwl l ∈TT } l ≠ m, l ≠ e e} l ≠ m, l ≠ e l l m e D = {w| thus, e = Da for D w∈J} ⊕ D English ⊕ Dtextimplies only an exact m e thus, D = D match with the ⊕ D ⊕ De implies a samemlanguage only as query an source or the exact wherewhere T is T is the setthe set of translation of translation of the keyword of the keyword wl wl in other match with language. the same language as query or the source in other languagelanguage language.
2117 Context based match C on prophetic food later developed as mobile apps. The knowledge sources are from al-Quran and Hadith which are in Arabic, classical manuscript in Malay language C = {ws| ws ∈N} ∪ {wm| wm ∈O} either written in Roman or Jawi which are mainly in English manage using a content where ws are words matched in the synonym list, N and management system (Norwawi et al. 2019). Results from wm are words matched from the knowledge taxanomy, O. NAISSE search engine on Quran were also compared to https://quran.com, a search engine for Al-Quran as shown A CLIR application was developed called NAISSE in Table 3. (naisse.org). It is an online one-stop knowledge repository TABLE 3. Comparison search results between quran.com and naisse.org Query No. of verses No. of verses Similar Synonym Context Spelling Duplicate (Single word) matched matched from verses naisse.org naisse.org Variety (Arabic) verse from quran. naisse.org naisse.org com Kurma (Malay) 26 29 19 1 4 4 1 Dates (English) 30 29 17 4 7 15 0 ليخنلا (Arabic with 1 29 1 5 4 18 1 Diacritic) بطر (Arabic without 2 3 1 0 0 2 0 diacritic) Based on Table 3, the results show that naisse. for example ليخن, بطر, ف َ تِ اًلي, ص ِ ٌناَوْن, قَن ِ اًريand in org is comparable to quran.com for Malay and English context of meaning ش َ ِي َ ط ٍةَرَج ّ ب َ ٍة, ق ْ ٍريِم, ط ِ ط َ. عْل ٌ ضَن ِ ٌديand word except in Arabic. Upon examining and comparing ٍةَنيِل ْنِم. the results, Arabic language source of ambiguity is due to grammatical syntax rules. However, this could be CONCLUSION overcome using translation, synonym, and context. This study demonstrated the interconnection of For example, if the query is Kurma which is in Malay language, there are 6 exact matches with the Malay such as Al-Quran, Hadith, Classical Manuscript, and translation of the relevant Quranic verses, 1 synonym which is Tamar and the rest are either the translation a one-stop repository for easy access. A multilingual into English and Arabic or context based on the synonym knowledge repository, NAISSE, developed with three or semantic as determined by expert. Query translation major keyword matching mechanism: direct match, using the parallel corpora is simple due to the translation dictionary-based translation and context based through already available for all the text in its respective synonym or its semantic represented by the knowledge documents as also recommended by Prasath et al. (2015) taxonomy, structure of the keyword and resources who use Tamil and English in their study. This study NAISSE will match the keyword search will continue with further improvement in handling with multi documents using a semantic technique for the Arabic language as evident in Table 3. There are its query translation based on the translation, synonym varieties of terms of Arabic related to Kurma and Dates
2118 and semantic. It can be expandable to other languages by Elayeb, B. & Bournas, I. 2016. Arabic cross-language information developing the linguistic matching algorithm that suits retrieval: a review. ACM Transactions on Asian and Low- the language. Resource Language Information Processing (TALLIP) 15(3): There are lots of potential for the future works 1-44. Jena, G.C. & Rautaray, S.S. 2019. A comprehensive survey on since this is an initial study for MLIR using Arabic, cross-language information retrieval system. Indonesian English, Malay including the Jawi text using data from Journal of Electrical Engineering and Computer Science Quran, Hadith, classical manuscript and articles related 14(1): 127-134. Norwawi, N.M., Perumal, S., Sempo, M.W., Huda, E. & Jeng, W. Future study may take advantage of the simple approach 2019. Multi-lingual content management system for prophetic food. In Proceedings of the International Conference on query translation with semi-automated generation of Islamic Applications in Computer Science and Technologies (IMAN 2019). 27(28). terms of search results, it can be improved by focusing Prasath, R., Sarkar, S. & O’Reilly, P. 2015. Improving cross on reducing the language ambiguity through parallel language information retrieval using corpus based query corpora based and handing the grammatical aspect of the suggestion approach. In International Conference on Intelligent Text Processing and Computational Linguistics. language especially Arabic. Springer, Cham. pp. 448-457. Sharma, M. & Morwal, S. 2015. A survey on cross-language ACKNOWLEDGEMENTS information retrieval. International Journal of Advanced The authors express their appreciation for the Niche Research in Computer and Communication Engineering Research Grant Scheme, awarded by the Ministry of 4(2): 384-387. Education to Universiti Sains Islam Malaysia (USIM) Tawil, S.F.M., Ismail, R., Wahid, F.A., Norwawi, N.M. & Mazlan, with project code NRGS-P/FST/8404/52113. A.A. 2016. Application of OASys approaches for dates ontology. In Third International Conference on Information REFERENCES Retrieval and Knowledge Management (CAMP). IEEE. pp. Abusalah, M., Tait, J. & Oakes, M. 2005. Literature review of 131-135. cross-language information retrieval. World Academy of Science, Engineering and Technology 4: 175-177. Faculty of Science and Technology Agbele, K.K., Ayetiran, E.F. & Aruleba, K.D. 2018. Survey on Universiti Sains Islam Malaysia cross-lingual information retrieval. International Journal of 71800 Nilai, Negeri Sembilan Darul Khusus 9(8): 484-491. Malaysia Aldhlan, K.A., Zeki, A.M. & Zeki, A.M. 2010. Datamining and Islamic knowledge extraction: Alhadith as a knowledge *Corresponding author; email: norita@usim.edu.my resource. In Proceedings of the International Conference on Information and Communication Technology for Muslim Received: 23 January 2020 World (ICT4M). IEEEE. H-21. Accepted: 1 April 2020 Azad, H.K. & Deepak, A. 2019. Query expansion techniques for information retrieval: A survey. Information Processing & Management 56(5): 1698-1735.
You can also read