Formula Based Regional Dialect Identification of Telugu language - sersc
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 9478 - 9486 Formula Based Regional Dialect Identification of Telugu language Using LDA Ongole Gandhi1, Sajida Sultana. Sk2, Shivaprasad. S3 1, 2, 3 Assistant professor, Department of CSE, VFSTR University, Guntur 1 ongolegandhi@gmail.com 2sajidashaik550@gmail.com 3 shiva.prasad923@gmail.com Abstract Dialect is usually used to signify the language based on the regions or how particularly that is spelled by the local people. Telugu is standard and historical language where we can find four dialects such as Coastal Andhra Slang , Mid Andhra Pradesh slang, Rayalaseema slang , Telangana slang for these Dialects we have created databases such as Andhra (Coastal Andhra Slang + Mid Andhra Pradesh slang), Rayalaseema slang, Telangana slang . There is no standard data base either in speech or text format to identify the regional dialects. This is the main reason to less research in Telugu dialects. In this we created standard database to identify the dialects of Telugu language in text format and we digitalized the data set for pattern recognition, for this we utilized Anu- Script manager to give it a base in the form of formula. We applied to Linear discriminant analysis pattern recognition algorithm to identify the required pattern which is used in identifying the dialect to which the word belong to. Keywords: Dialects, Anu-script manager, Pattern recognition, Linear Discriminant Analysis. 1. Introduction Language is known as combination of conditional statements which are used to form a proper or improper communication with the familiar/unfamiliar people for expressing thoughts or beliefs or anything that crosses their imagination. The quantity of human dialects on the planet may differ somewhere in the range of 8,000 and 9,000. In any case, any exact picture relies upon the discretionary and western in its cause qualification (division) between dialects (or rather de:Einzelsprachen) and dialects. Natural Languages are utilized as an entity for the communication which can be utilized into visual or material for example we can consider it as a hard copy or a record of data for a place or a thing or person. Whistling or marking or braille. This is one of the main grounds that human language perception is methodology autonomous. All significant dialects depends upon the procedure of symbiosis that relates to signs of specific implications [4]. Oral, manual and material dialects contain an phonological substructure that administers how images can be utilized to shape preparations known as words or morphemes, and a syntactic subsystem oversees how many words and morphemes are considered to frame expressions and articulations. Language has the reliable properties of efficiency and relocation, and totally depends on the social show and what it is learning. Its complex structure manages a lot more broad scope of utterances than any known availability of arrangement for creature correspondence. Language is assimilated into human cerebrum ISSN: 2005-4238 IJAST 9478 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 9478 - 9486 when they were born in the region [3] . The gathering of dialects that are derived from antecedent is generally referred as language family. The Indo-European family [9] which has 2.910 billion people in which has 437 daughter languages, the speakers are nearly half of the total global population. Among those families the renowned families or widely used languages including English , Spanish , French , German , Russia ,Punjabi ,Bengali and Hindustani and several different dialects are spoken all over the pacific [19]. The Dravidian family of dialects are spoken most parts of Southern India, which includes Tamil, Telugu and Kannada. The agreement scholars have constructed is in the range of more than 90% of dialects spoken towards the start of a new century most of the dialects have been wiped out drastically 2100[3]. The term lingo is utilized in two particular formats to illustrate to two different kinds of phonetic instances: A lingo that is related for a specific social domain can be referred as sociolect, a vernacular that is related with a specific ethnic gathering can be named an ethnolect, and a topographical/provincial tongue might be named a regiolect (elective terms incorporate 'regionalect', 'geolect', and 'topolect'). As per this definition, any assortment of a given language can be delegated "a lingo", including any institutionalized assortments. Right now, differentiation between the "standard language" (for example the "standard" lingo of a specific language) and the "nonstandard" (vernacular) tongues of a similar language is frequently self-assertive and dependent on social, political, social, or chronicled considerations [23]. Right now, in the principal utilization, the institutionalized language would not itself be viewed as a "tongue", as it is the predominant language in a specific state or locale, be it as far as etymological esteem, social or political (for example official) status, transcendence or commonness, or the entirety of the above mentioned. In the mean time, under this use, the "vernaculars" are for the most part not varieties of the predominant language yet rather isolated (though frequently inexactly related) dialects all by themselves. Therefore, these "vernaculars" are not tongues or assortments of a specific language in a similar sense as in the primary use[9][10]; however they may share establishes in a similar subfamily as the predominant language and may even, to a differing degree, share some common comprehensibility with the institutionalized language, they frequently didn't advance intimately with the standard language or inside a similar phonetic subgroup or discourse network as the institutionalized language and rather may better fit different gatherings' criteria for a different language. The expression "lingo" utilized [7][8] thusly infers a political undertone, being generally used to allude to low-eminence dialects (paying little heed to their genuine level of good ways from the national language), dialects lacking institutional help, or those apparent as unacceptable for writing. The assignment "tongue" is additionally utilized famously to allude to the unwritten or non-classified dialects of creating nations or secluded areas, where the expression "vernacular language" would be favored by linguists [10]. 2. Proposed Methodology In Indian regional language identification research on regional dialects are significantly less. Therefore , our research main objective is to identify the Indian regional dialect keeping this objective in our mind we have considered the dialects of Telugu Language[1] The four regional dialects in Telugu Language is : ISSN: 2005-4238 IJAST 9479 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 9478 - 9486 1. Rayalaseema dialect : population nearly 1.52 crores as per 2011 survey consists of 4 districts on which it is spoken 2. Telengana dialect : population nearly 3.52 crores as per 2011 survey consists of 31 districts on which it is spoken 3. Kostha Andhra dialect- spoken in Krishna, Guntur and Godavari districts. 4. Kaling Andhra dialect- spoken in major districts like Visakhapatnam, Vijayanagaram, Srikakulam. As the kostha Andhra and kalinga Andhra are significantly related so we considered both the entity’s dialects as one entity dialect. The information gathered up in the creation of dialect can have different conclusions and theory’s. To specify some reasons the language variation of emperors of a region to the caste system of that region along with the habilitation of people belonging to that significant region. In the case of Telengana dialect, there is a significant influence of Tamil and Kannada Languages along with a proportionate amount of Urdu Language so the dialect become more significantly popular. The scenario of Rayalaseema we have to consider the historical and geographical reasons and the significant influence of Tamil and Kannada Languages. In case of Kosthandhra there is a significant influence of Sanskrit and English by this scenario by considering these scenarios we have classified major dialects based on the popularity of local significance in Telugu speaking states. In India, Telengana state is 12th position in populated state and its 11th position in area it is significantly one of the largest state. It has a prosperous folk music culture. It is located in the central south corridor of the Indian peninsula and it has it is a an informal language spoken variety of Telugu language most likely renowned and this kind of slang mostly used in the movies. According to the survey conducted in the may 1 2011 of India there is a most significant number of people are utilizing the slangs in the approximate count of 35.2million speakers(Telengana) ,34.5 million speakers(Kostha-andhra)and 15.2 million speakers(Rayalaseema) . Until now there is no significant dialect database for the Telugu Language since the in availability is a major drawback for insignificant research. For this purpouse, in this research paper, we have built an dialect databases such as Andhra dialect(Coastal Andhra Slang + Mid Andhra Pradesh slang), Rayalaseema dialect, Telangana dialect that are essential for the identification of regionalism through word and with sentences are done . 2.1Anu-Script manager It is a renowned software which can be utilized to write more 3 languages with a base which can be utilized for applying pattern recognition and it has played a pivotal role in our practical testing. It supports languages like Hindi, Devnagari, Telugu, Tamil, Kannada, Malayalam Supported Applications:- MS-Word, Photoshop, Pagemaker, Corel and some increasingly (Maximum all Windows Applications) Example computerized acknowledgement is the examples of acknowledgement with regularities in the information. The recognized terms given below: AI is one of many number of ways to deal with acknowledgment of the design, while there are many different approaches that are incorporated with selectively made rules or heuristics; and example acknowledgment is one of the way to deal with relevant reasoning, while quite a ISSN: 2005-4238 IJAST 9480 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 9478 - 9486 number of different methodologies are incorporated with emblematic with fraud intelligence. Fig.1 Shows the process shows usage of anu script 2.2 Linear Discriminant Analysis It is like PCA but it signifies the focus on the maximizing the seperatibility among the known categories. Let us consider an example if we want to reduce the 2D map into 1D map to maximize the seperability of two categories LDA uses information of both axis to create a new access and projects the data into the new axis way to maximize the separation of two different categories. We have two different criteria to make it possible that are consider simultaneously First criteria is to maximize the distance between two means and Second criteria is to minimize the variation of two categories which the LDA refers as scattering within each categories we have a ratio of the difference between the two mean square over the sum of scattered. Methodology follows is the basic things that are available for every recognition system is training and testing. In a training part, working with system will be developed where as during the part of testing we will assimilate the system which is created during training phase. In this paper, following sections are 2.1 describes about Database creation, section 2.2 describes about Database Digitalization section 2.3 describes about methodology. 2.3. Database Creation For an effective Identification System it is essential to have database sets. These datasets have to be collected significantly since word based classification requires huge amount of database for the proposed method. Since Dialect database differs significantly for different regionalism so we have to divide datasets based on regionalism and gather datasets from different people to have a set of words form as many regional familiar people we could have found. So we created three dialect databases such as Andhra Dialect Database, Telengana Dialect Database, Rayalasema Dialect Database. For Andhra Database creation we have approached nearly 50-60 people and asked them to tell in their native slang words which were used commonly from that we have filtered words by cross verifying them (we have removed repeated words in between Datasets which are collected from each person) and we have made a whole Database by combining all the filtered datasets. For Telengana Database creation we have approached nearly 80-90 people and ISSN: 2005-4238 IJAST 9481 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 9478 - 9486 asked them to tell in their native slang words which were used commonly from that we have filtered words by cross verifying them(we have removed repeated words in between Datasets which are collected from each person)and we have made a whole Database by combining all the filtered datasets. For Rayalasema Database creation we have approached nearly 20-30 people and asked then to tell in their native slang words from that we have filtered words by cross verifying them (we have removed repeated words in between Datasets which are collected from each person)and we have made a whole Database by combining all the filtered datasets. Fig.1 Shows the complete process applied while creating the data base and Table 1 Shows complete database. Dialect No of Sets gathered Total Number of words Andhra Pradesh 56 690 Telengana 86 1578 Rayalasema 25 270 Table 1 :- Complete dataset 2.4. Database Digitalization: We have made a whole of three Dialect Databases by gathering multiple number of datasets from different people belong to same regionalism based on three regions such as Andhra Dialect , Rayalasema Dialect , Telengana Dialect but the problem is that the text which can be used in online text converter has no base value so it can’t be utilized in Identification process. So we have utilized the Anu-script manager software which is widely renowned for regional based typing in commonly used application with database of a formula which is quite beneficial for identification purposes. We have digitalized the databases which are made by filtering of different datasets. We have digitalized 2,538 words to carry out the work. 2.5. Methodology (a)Training: Since we have no regional specific dialect database until now we have to create huge amount of database. Since dialect database differs significantly for different regionalism so we have to divide datasets based on regionalism and gather datasets from different people to have a set of words form as many regional familiar people we could have found. So we created three dialect databases such as Andhra dialect database, Telengana dialect database, Rayalsema dialect database on an average we have to talk to 10-15 people approximately and store the words from ach person in the form of sets then we can digitalized by filtering the datasets and group them by their regionalism and digitalize them into database. After successful completion of database we have move forwarded with sentence pattern recognition in which we have taken a considerable number of files in which the sentences belong to different slangs ISSN: 2005-4238 IJAST 9482 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 9478 - 9486 Fig.2 Training Phase (b)Testing:- During testing phase we have utilized python programming to implement Linear discriminant analysis pattern recognition algorithm. We have build a code which could take a series of input or a single input raw data from user then checks for similarity of the data. for example if we give an input to the system it will check the data present in the database for similarity if found returns the value/values such phenomenon can be observed here when the raw input is being taken and similar value can be found if it exits in the database then after seeking multiple inputs it could classify the raw input based on the datasets and written in the specific format such as if we take series of input the inputs are divided as per regionalism thus the execution of linear discriminant pattern recognition is accomplished by the code and for sentences we will first split the sentence and evaluate based on preprocessing the sentences and the rest of execution is same. We have utilized python programing on excel sheets for creation and operation for implementation of pattern recognition. Fig.3. Testing process 3. Results S.NO DIALECT No. of words NO. of sentences in testing 1 Telengana 690 18 2 Rayalaseema 1578 16 3 Andhra 270 17 ISSN: 2005-4238 IJAST 9483 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 9478 - 9486 Table 2 Statistical description of data base used for dialect identification No. No. of Testing Of sentences Method dialects Accuracy sentences correctly identified Telengana 18 16 88.8% LDA Rayalaseema 16 13 81.25% Andhra 17 14 82.35% Table 3: summarize the accuracy of sentences Fig.4 Testing samples and identified sentences Fig.5 Accurcy representation 4. Conclusion & Future Scope In this we identify the sentence based dialects identification of Telugu language using LDA model. We used anu-script manager to digitalize the data set. The main reason to do less research in dialect identification is lack database. In this we are created dataset collected from different places belongs to different regions in Telangana, Rayalaseema and Kostha Andhra. We train the LDA method and apply testing to identify which slang it is. LDA provide the good accuracy 88.8% for Telangana , 81.25% for Rayalaseema and 82.35% for Andhra . In future one can increase the data base size and apply different machine learning models to identify the dialects. ISSN: 2005-4238 IJAST 9484 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 9478 - 9486 References [1]. Shivapasad Satla M. Sadanandam “Identification of regional dialects of Telugu language using text independent speech processing models” International Journal of Speech Technology, Feb 2020 DOI: 10.1007/s10772-020-09678-y [2]. Al-Walaie, M. A., & Khan, M. B. (2017). Arabic dialects classification using text mining techniques. In International conference on computer and applications (ICCA). [3]. Bailey, C. N. (1968). Is there a midland dialect? Washington, D.C.: ERIC Clearinghouse. [4]. Balleda, J., Murthy, H. A., & Nagarajan, T. (2000). Language identification from short segments of speech. In Proceedings of the INTERSPEECH (pp. 1033–1036). [5]. Chen, M., Wang, L., & Xu, C.-Z. (2017). A novel approach of system design for dialect speech interaction with NAO robot. In 18th international conference on advanced robotics (ICAR). [6]. Chittaragi, N. B., & Koolagudi, S. G. (2017). Acoustic features based word level dialect classification using SVM and ensemble methods. In IC3, Noida, 10–12 August 2017. [7]. Xin Guo, , Yang Xiang, Qian Chen “LDA-based online topic detection using tensor factorization”, Journal of Information Science, March 8, 2013 https://doi.org/10.1177/0165551512473066 [8]. Aytug Onan, Hasan Bulut, Serdar Korukoglu "An improved ant algorithm with LDA-based representation for text document clustering" , Journal of Information Science March 22,2016. doi/10.1177/0165551516638784. [9]. P. Anupriya ; S. Karpagavalli “LDA based topic modeling” 2015 International Conference on Advanced Computing and Communication Systems 5-7 Jan. 2015 [10]. Ayoub Bagheri, Mohamad Saraee, Franciska de JongFirst "ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences" June 11, 2014 https://doi.org/10.1177/0165551514538744 [11]. Liu, B, Zhang, L. A survey of opinion mining and sentiment analysis. In: Mining text data, 1st edn. New York: Springer, 2012, pp. 415–463. [12]. Ku, LW, Chen, HH. Mining opinions from the Web: beyond relevance retrieval. Journal of the American Society for Information Science and Technology2007; 58(12): 1838–1850. [13]. Me. Jakeera Begum and M.Venkata Rao, (2015), “Collaborative Tagging Using CAPTCHA” International Journal of Innovative Technology And Research, Volume No.3, Issue No.5,pp,2436 – 2439. [14]. L.Jagajeevan Rao, M. Venkata Rao, T.Vijaya Saradhi (2016), “How The Smartcard Makes the Certification Verification Easy” Journal of Theoretical and Applied Information Technology, Vol.83. No.2, pp. 180-186. [15]. Venkata Rao Maddumala, R. Arunkumar, and S. Arivalagan (2018)“An Empirical Review on Data Feature Selection and Big Data Clustering” Asian Journal of Computer Science and Technology Vol.7 No.S1, pp. 96-100. [16]. Singamaneni Kranthi Kumar, Pallela Dileep Kumar Reddy, Gajula Ramesh, Venkata Rao Maddumala, (2019), “Image Transformation Technique Using Steganography Methods Using LWT Technique” ,Traitement du Signal, vol 36, No 3, pp. 233-237. [17]. Banavathu Mounika, Sk. Reshmi Khadherbhi, Venkata Rao Maddumala, R.S.M Lakshmi (2020 ), “Data Distribution Method With Text Extraction From Big Data”, Journal of Critical Reviews, Vol 7, Issue 6, 2020, pp. 376-380. [18]. V. Lakshman Narayana, B. Naga Sudheer, Venkata Rao Maddumala, P.Anusha, (2020), “Fuzzy Base Artificial Neural Network Model For Text Extraction From Images”, Journal of Critical Reviews, Vol 7, Issue 6, 2020, pp. 350-354. [19]. R.S.M. Lakshmi Patibandla, B. Tarakeswara Rao, P. Sandhya Krishna,Venkata Rao Maddumala4 “Medical Data Clustering Using Particle Swarm Optimization Method”, Journal of Critical Reviews, Vol 7, Issue 6, 2020, pp. 363-367. [20]. J. Tejaswini, T. Mohana Kavya , R. Devi Naga Ramya , P. Sai Triveni, Venkata Rao Maddumala,(2020), “Accurate Loan Approval Prediction Based On Machine Learning Approach”, Journal of Engineering Science, Vol 11, Issue 4 , April/ 2020, pp. 523-532. [21]. K. Yamini, K. Sai Swetha , P. Lakshmi Prasanna, M. Rupa Venkata Swathi , Venkata Rao Maddumala(2020), Image Colorization With Deep Convolutional Open CV, Journal of Engineering Science Vol 11, Issue 4 , April/ 2020, pp. 533-543. ISSN: 2005-4238 IJAST 9485 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 3, (2020), pp. 9478 - 9486 [22]. Arepalli, Gopi & Erukula, Suresh & Gopi, A.P. & Nagaraju, Chiluka. (2016). Secure multicast routing protocol in MANETs using efficient ECGDH algorithm. International Journal of Electrical and Computer Engineering (IJECE). 6. 1857-1865. 10.11591/ijece.v6i4.9941. [23]. K. Sarada, V. Lakshman Narayana,(2020), “Improving Relevant Text Extraction Accuracy using Clustering Methods”, TEST Engineering and Management, Volume 83, Page Number: 15212 – 15219. [24]. K. Sarada, V. Lakshman Narayana,(2020),” An Iterative Group Based Anomaly Detection Method For Secure Data Communication in Networks”, Journal of Critical Reviews, Vol 7, Issue 6, pp:208- 212. doi: 10.31838/jcr.07.06.39. [25]. Banavathu Mounika, P. Anusha, V. Lakshman Narayana,(2020), “ Use of BlockChain Technology In Providing Security During Data Sharing”, Journal of Critical Reviews, Vol 7, Issue 6, pp:338- 343. doi: 10.31838/jcr.07.06.59. [26]. V. Lakshman Narayana, B. Naga Sudheer,(2020),” Fuzzy Base Artificial Neural Network Model For Text Extraction From Images”, Journal of Critical Reviews, Vol 7, Issue 6,pp:350-354, doi: 10.31838/jcr.07.06.61. [27]. V. Lakshman Narayana, A. Peda Gopi,(2020),” Accurate Identification And Detection Of Outliers In Networks Using Group Random Forest Methodoly”, Journal of Critical Reviews, Vol 7, Issue 6,pp:381-384, doi: 10.31838/jcr.07.06.67. [28]. Sandhya Pasala, V. Pavani, G. Vidya Lakshmi, V. Lakshman Narayana,(2020),” Identification Of Attackers Using Blockchain Transactions Using Cryptography Methods”, Journal of Critical Reviews, Vol 7, Issue 6,pp:368-375, doi: 10.31838/jcr.07.06.65 [29]. C.R.Bharathi, Vejendla. Lakshman Narayana , L.V. Ramesh, (2020),” Secure Data Communication Using Internet of Things”, International Journal of Scientific & Technology Research, Volume 9, Issue 04,pp:3516-3520. ISSN: 2005-4238 IJAST 9486 Copyright ⓒ 2020 SERSC
You can also read