Formula Based Regional Dialect Identification of Telugu language - sersc

Page created by Marion Mckinney
 
CONTINUE READING
International Journal of Advanced Science and Technology
                                                              Vol. 29, No. 3, (2020), pp. 9478 - 9486

Formula Based Regional Dialect Identification of Telugu language
                         Using LDA

                Ongole Gandhi1, Sajida Sultana. Sk2, Shivaprasad. S3
    1, 2, 3 Assistant professor, Department of CSE, VFSTR University, Guntur
             1
               ongolegandhi@gmail.com 2sajidashaik550@gmail.com
                           3
                             shiva.prasad923@gmail.com

                                         Abstract
 Dialect is usually used to signify the language based on the regions or how particularly
that is spelled by the local people. Telugu is standard and historical language where we
can find four dialects such as Coastal Andhra Slang , Mid Andhra Pradesh slang,
Rayalaseema slang , Telangana slang for these Dialects we have created databases such
as Andhra (Coastal Andhra Slang + Mid Andhra Pradesh slang), Rayalaseema slang,
Telangana slang . There is no standard data base either in speech or text format to
identify the regional dialects. This is the main reason to less research in Telugu dialects.
In this we created standard database to identify the dialects of Telugu language in text
format and we digitalized the data set for pattern recognition, for this we utilized Anu-
Script manager to give it a base in the form of formula. We applied to Linear discriminant
analysis pattern recognition algorithm to identify the required pattern which is used in
identifying the dialect to which the word belong to.

  Keywords: Dialects, Anu-script manager, Pattern recognition, Linear Discriminant
Analysis.

    1. Introduction

Language is known as combination of conditional statements which are used to form a
proper or improper communication with the familiar/unfamiliar people for expressing
thoughts or beliefs or anything that crosses their imagination. The quantity of human
dialects on the planet may differ somewhere in the range of 8,000 and 9,000. In any case,
any exact picture relies upon the discretionary and western in its cause qualification
(division) between dialects (or rather de:Einzelsprachen) and dialects. Natural Languages
are utilized as an entity for the communication which can be utilized into visual or
material for example we can consider it as a hard copy or a record of data for a place or a
thing or person. Whistling or marking or braille. This is one of the main grounds that
human language perception is          methodology autonomous. All significant dialects
depends upon the procedure of symbiosis that relates to signs of specific implications [4].
Oral, manual and material dialects contain an phonological substructure that administers
how images can be utilized to shape preparations known as words or morphemes, and a
syntactic subsystem oversees how many words and morphemes are considered to frame
expressions and articulations. Language has the reliable properties of efficiency and
relocation, and totally depends on the social show and what it is learning. Its complex
structure manages a lot more broad scope of utterances than any known availability of
arrangement for creature correspondence. Language is assimilated into human cerebrum

  ISSN: 2005-4238 IJAST                                                                        9478
  Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology
                                                               Vol. 29, No. 3, (2020), pp. 9478 - 9486

when they were born in the region [3] . The gathering of dialects that are derived from
antecedent is generally referred as language family. The Indo-European family [9] which
has 2.910 billion people in which has 437 daughter languages, the speakers are nearly half
of the total global population. Among those families the renowned families or widely
used languages including English , Spanish , French , German , Russia ,Punjabi ,Bengali
and Hindustani and several different dialects are spoken all over the pacific [19]. The
Dravidian family of dialects are spoken most parts of Southern India, which includes
Tamil, Telugu and Kannada. The agreement scholars have constructed is in the range of
more than 90% of dialects spoken towards the start of a new century most of the dialects
have been wiped out drastically 2100[3]. The term lingo is utilized in two particular
formats to illustrate to two different kinds of phonetic instances:

A lingo that is related for a specific social domain can be referred as sociolect, a
vernacular that is related with a specific ethnic gathering can be named an ethnolect, and a
topographical/provincial tongue might be named a regiolect (elective terms incorporate
'regionalect', 'geolect', and 'topolect'). As per this definition, any assortment of a given
language can be delegated "a lingo", including any institutionalized assortments. Right
now, differentiation between the "standard language" (for example the "standard" lingo of
a specific language) and the "nonstandard" (vernacular) tongues of a similar language is
frequently self-assertive and dependent on social, political, social, or chronicled
considerations [23].

Right now, in the principal utilization, the institutionalized language would not itself be
viewed as a "tongue", as it is the predominant language in a specific state or locale, be it
as far as etymological esteem, social or political (for example official) status,
transcendence or commonness, or the entirety of the above mentioned. In the mean time,
under this use, the "vernaculars" are for the most part not varieties of the predominant
language yet rather isolated (though frequently inexactly related) dialects all by
themselves. Therefore, these "vernaculars" are not tongues or assortments of a specific
language in a similar sense as in the primary use[9][10]; however they may share
establishes in a similar subfamily as the predominant language and may even, to a
differing degree, share some common comprehensibility with the institutionalized
language, they frequently didn't advance intimately with the standard language or inside a
similar phonetic subgroup or discourse network as the institutionalized language and
rather may better fit different gatherings' criteria for a different language. The expression
"lingo" utilized [7][8] thusly infers a political undertone, being generally used to allude to
low-eminence dialects (paying little heed to their genuine level of good ways from the
national language), dialects lacking institutional help, or those apparent as unacceptable
for writing. The assignment "tongue" is additionally utilized famously to allude to the
unwritten or non-classified dialects of creating nations or secluded areas, where the
expression "vernacular language" would be favored by linguists [10].
2. Proposed Methodology
In Indian regional language identification research on regional dialects are significantly
less. Therefore , our research main objective is to identify the Indian regional dialect
keeping this objective in our mind we have considered the dialects of Telugu Language[1]
The four regional dialects in Telugu Language is :

   ISSN: 2005-4238 IJAST                                                                        9479
   Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology
                                                                Vol. 29, No. 3, (2020), pp. 9478 - 9486

     1. Rayalaseema dialect : population nearly 1.52 crores as per 2011 survey consists
         of 4 districts on which it is spoken
     2. Telengana dialect : population nearly 3.52 crores as per 2011 survey consists of
         31 districts on which it is spoken
     3. Kostha Andhra dialect- spoken in Krishna, Guntur and Godavari districts.
     4. Kaling Andhra dialect- spoken in major districts like Visakhapatnam,
         Vijayanagaram, Srikakulam.

As the kostha Andhra and kalinga Andhra are significantly related so we considered both
the entity’s dialects as one entity dialect. The information gathered up in the creation of
dialect can have different conclusions and theory’s. To specify some reasons the language
variation of emperors of a region to the caste system of that region along with the
habilitation of people belonging to that significant region.
 In the case of Telengana dialect, there is a significant influence of Tamil and Kannada
Languages along with a proportionate amount of Urdu Language so the dialect become
more significantly popular. The scenario of Rayalaseema we have to consider the
historical and geographical reasons and the significant influence of Tamil and Kannada
Languages.
In case of Kosthandhra there is a significant influence of Sanskrit and English by this
scenario by considering these scenarios we have classified major dialects based on the
popularity of local significance in Telugu speaking states. In India, Telengana state is 12th
position in populated state and its 11th position in area it is significantly one of the largest
state. It has a prosperous folk music culture. It is located in the central south corridor of
the Indian peninsula and it has it is a an informal language spoken variety of Telugu
language most likely renowned and this kind of slang mostly used in the movies.
According to the survey conducted in the may 1 2011 of India there is a most significant
number of people are utilizing the slangs in the approximate count of 35.2million
speakers(Telengana) ,34.5 million speakers(Kostha-andhra)and 15.2 million
speakers(Rayalaseema) . Until now there is no significant dialect database for the Telugu
Language since the in availability is a major drawback for insignificant research. For this
purpouse, in this research paper, we have built an dialect databases such as Andhra
dialect(Coastal Andhra Slang + Mid Andhra Pradesh slang), Rayalaseema dialect,
Telangana dialect that are essential for the identification of regionalism through word and
with sentences are done .

2.1Anu-Script manager
 It is a renowned software which can be utilized to write more 3 languages with a base
which can be utilized for applying pattern recognition and it has played a pivotal role in
our practical testing. It supports languages like Hindi, Devnagari, Telugu, Tamil,
Kannada, Malayalam Supported Applications:- MS-Word, Photoshop, Pagemaker, Corel
and some increasingly (Maximum all Windows Applications)
Example computerized acknowledgement is the examples of acknowledgement with
regularities in the information. The recognized terms given below: AI is one of many
number of ways to deal with acknowledgment of the design, while there are many
different approaches that are incorporated with selectively made rules or heuristics; and
example acknowledgment is one of the way to deal with relevant reasoning, while quite a

   ISSN: 2005-4238 IJAST                                                                         9480
   Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology
                                                              Vol. 29, No. 3, (2020), pp. 9478 - 9486

number of different methodologies are incorporated with emblematic with fraud
intelligence. Fig.1 Shows the process shows usage of anu script

2.2 Linear Discriminant Analysis
 It is like PCA but it signifies the focus on the maximizing the seperatibility among the
known categories. Let us consider an example if we want to reduce the 2D map into 1D
map to maximize the seperability of two categories
LDA uses information of both axis to create a new access and projects the data into the
new axis way to maximize the separation of two different categories.
We have two different criteria to make it possible that are consider simultaneously
First criteria is to maximize the distance between two means and Second criteria is to
minimize the variation of two categories which the LDA refers as scattering within each
categories we have a ratio of the difference between the two mean square over the sum of
scattered. Methodology follows is the basic things that are available for every recognition
system is training and testing. In a training part, working with system will be developed
where as during the part of testing we will assimilate the system which is created during
training phase. In this paper, following sections are 2.1 describes about Database creation,
section 2.2 describes about Database Digitalization section 2.3 describes about
methodology.
2.3. Database Creation
For an effective Identification System it is essential to have database sets. These datasets
have to be collected significantly since word based classification requires huge amount of
database for the proposed method. Since Dialect database differs significantly for
different regionalism so we have to divide datasets based on regionalism and gather
datasets from different people to have a set of words form as many regional familiar
people we could have found. So we created three dialect databases such as Andhra Dialect
Database, Telengana Dialect Database, Rayalasema Dialect Database. For Andhra
Database creation we have approached nearly 50-60 people and asked them to tell in their
native slang words which were used commonly from that we have filtered words by cross
verifying them (we have removed repeated words in between Datasets which are collected
from each person) and we have made a whole Database by combining all the filtered
datasets. For Telengana Database creation we have approached nearly 80-90 people and

  ISSN: 2005-4238 IJAST                                                                        9481
  Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology
                                                               Vol. 29, No. 3, (2020), pp. 9478 - 9486

asked them to tell in their native slang words which were used commonly from that we
have filtered words by cross verifying them(we have removed repeated words in between
Datasets which are collected from each person)and we have made a whole Database by
combining all the filtered datasets. For Rayalasema Database creation we have
approached nearly 20-30 people and asked then to tell in their native slang words from
that we have filtered words by cross verifying them (we have removed repeated words in
between Datasets which are collected from each person)and we have made a whole
Database by combining all the filtered datasets. Fig.1 Shows the complete process applied
while creating the data base and Table 1 Shows complete database.

         Dialect            No of Sets gathered               Total Number of words

     Andhra Pradesh                   56                                   690
        Telengana                     86                                  1578
       Rayalasema                     25                                   270

                               Table 1 :- Complete dataset

2.4. Database Digitalization:
We have made a whole of three Dialect Databases by gathering multiple number of
datasets from different people belong to same regionalism based on three regions such as
Andhra Dialect , Rayalasema Dialect , Telengana Dialect but the problem is that the text
which can be used in online text converter has no base value so it can’t be utilized in
Identification process. So we have utilized the Anu-script manager software which is
widely renowned for regional based typing in commonly used application with database
of a formula which is quite beneficial for identification purposes. We have digitalized the
databases which are made by filtering of different datasets. We have digitalized 2,538
words to carry out the work.
2.5. Methodology
(a)Training:
         Since we have no regional specific dialect database until now we have to create
huge amount of database. Since dialect database differs significantly for different
regionalism so we have to divide datasets based on regionalism and gather datasets from
different people to have a set of words form as many regional familiar people we could
have found. So we created three dialect databases such as Andhra dialect database,
Telengana dialect database, Rayalsema dialect database on an average we have to talk to
10-15 people approximately and store the words from ach person in the form of sets then
we can digitalized by filtering the datasets and group them by their regionalism and
digitalize them into database. After successful completion of database we have move
forwarded with sentence pattern recognition in which we have taken a considerable
number of files in which the sentences belong to different slangs

  ISSN: 2005-4238 IJAST                                                                         9482
  Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology
                                                               Vol. 29, No. 3, (2020), pp. 9478 - 9486

                                      Fig.2 Training Phase

(b)Testing:-
During testing phase we have utilized python programming to implement Linear
discriminant analysis pattern recognition algorithm. We have build a code which could
take a series of input or a single input raw data from user then checks for similarity of the
data. for example if we give an input to the system it will check the data present in the
database for similarity if found returns the value/values such phenomenon can be
observed here when the raw input is being taken and similar value can be found if it exits
in the database then after seeking multiple inputs it could classify the raw input based on
the datasets and written in the specific format such as if we take series of input the inputs
are divided as per regionalism thus the execution of linear discriminant pattern
recognition is accomplished by the code and for sentences we will first split the sentence
and evaluate based on preprocessing the sentences and the rest of execution is same.
We have utilized python programing on excel sheets for creation and operation for
implementation of pattern recognition.

                                   Fig.3. Testing process
3. Results
S.NO                   DIALECT                  No. of words                 NO. of sentences
                                                                            in testing
1                      Telengana                690                         18
2                      Rayalaseema              1578                        16
3                      Andhra                   270                         17

    ISSN: 2005-4238 IJAST                                                                       9483
    Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology
                                                                  Vol. 29, No. 3, (2020), pp. 9478 - 9486

            Table 2 Statistical description of data base used for dialect identification
                                                            No.
                                             No. of Testing Of       sentences
Method             dialects                                                    Accuracy
                                             sentences      correctly
                                                            identified

                   Telengana                 18                   16                      88.8%

LDA                Rayalaseema               16                   13                      81.25%

                   Andhra                    17                   14                      82.35%
                         Table 3: summarize the accuracy of sentences

                         Fig.4 Testing samples and identified sentences

                                  Fig.5 Accurcy representation

  4. Conclusion & Future Scope
  In this we identify the sentence based dialects identification of Telugu language using
  LDA model. We used anu-script manager to digitalize the data set. The main reason to do
  less research in dialect identification is lack database. In this we are created dataset
  collected from different places belongs to different regions in Telangana, Rayalaseema
  and Kostha Andhra. We train the LDA method and apply testing to identify which slang it
  is. LDA provide the good accuracy 88.8% for Telangana , 81.25% for Rayalaseema and
  82.35% for Andhra . In future one can increase the data base size and apply different
  machine learning models to identify the dialects.

      ISSN: 2005-4238 IJAST                                                                        9484
      Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology
                                                                    Vol. 29, No. 3, (2020), pp. 9478 - 9486

References
[1].    Shivapasad Satla M. Sadanandam “Identification of regional dialects of Telugu language using text
        independent speech processing models” International Journal of Speech Technology, Feb 2020
        DOI: 10.1007/s10772-020-09678-y
[2].    Al-Walaie, M. A., & Khan, M. B. (2017). Arabic dialects classification using text mining
        techniques. In International conference on computer and applications (ICCA).
[3].    Bailey, C. N. (1968). Is there a midland dialect? Washington, D.C.: ERIC Clearinghouse.
[4].    Balleda, J., Murthy, H. A., & Nagarajan, T. (2000). Language identification from short segments of
        speech. In Proceedings of the INTERSPEECH (pp. 1033–1036).
[5].    Chen, M., Wang, L., & Xu, C.-Z. (2017). A novel approach of system design for dialect speech
        interaction with NAO robot. In 18th international conference on advanced robotics (ICAR).
[6].    Chittaragi, N. B., & Koolagudi, S. G. (2017). Acoustic features based word level dialect
        classification using SVM and ensemble methods. In IC3, Noida, 10–12 August 2017.
[7].     Xin Guo, , Yang Xiang, Qian Chen               “LDA-based online topic detection using tensor
        factorization”,       Journal      of      Information      Science,       March      8,     2013
        https://doi.org/10.1177/0165551512473066
[8].     Aytug Onan, Hasan Bulut, Serdar Korukoglu "An improved ant algorithm with LDA-based
        representation for text document clustering" , Journal of Information Science March 22,2016.
        doi/10.1177/0165551516638784.
[9].    P. Anupriya ; S. Karpagavalli “LDA based topic modeling” 2015 International Conference on
        Advanced Computing and Communication Systems 5-7 Jan. 2015
[10].   Ayoub Bagheri, Mohamad Saraee, Franciska de JongFirst "ADM-LDA: An aspect detection model
        based on topic modelling using the structure of review sentences"                  June 11, 2014
        https://doi.org/10.1177/0165551514538744
[11].   Liu, B, Zhang, L. A survey of opinion mining and sentiment analysis. In: Mining text data, 1st
        edn. New York: Springer, 2012, pp. 415–463.
[12].   Ku, LW, Chen, HH. Mining opinions from the Web: beyond relevance retrieval. Journal of the
        American Society for Information Science and Technology2007; 58(12): 1838–1850.
[13].   Me. Jakeera Begum and M.Venkata Rao, (2015), “Collaborative Tagging Using CAPTCHA”
        International Journal of Innovative Technology And Research, Volume No.3, Issue No.5,pp,2436 –
        2439.
[14].   L.Jagajeevan Rao, M. Venkata Rao, T.Vijaya Saradhi (2016), “How The Smartcard Makes the
        Certification Verification Easy” Journal of Theoretical and Applied Information Technology,
        Vol.83. No.2, pp. 180-186.
[15].   Venkata Rao Maddumala, R. Arunkumar, and S. Arivalagan (2018)“An Empirical Review on Data
        Feature Selection and Big Data Clustering” Asian Journal of Computer Science and Technology
        Vol.7 No.S1, pp. 96-100.
[16].   Singamaneni Kranthi Kumar, Pallela Dileep Kumar Reddy, Gajula Ramesh, Venkata Rao
        Maddumala, (2019), “Image Transformation Technique Using Steganography Methods Using LWT
        Technique” ,Traitement du Signal, vol 36, No 3, pp. 233-237.
[17].   Banavathu Mounika, Sk. Reshmi Khadherbhi, Venkata Rao Maddumala, R.S.M Lakshmi (2020 ),
        “Data Distribution Method With Text Extraction From Big Data”, Journal of Critical Reviews, Vol
        7, Issue 6, 2020, pp. 376-380.
[18].   V. Lakshman Narayana, B. Naga Sudheer, Venkata Rao Maddumala, P.Anusha, (2020), “Fuzzy
        Base Artificial Neural Network Model For Text Extraction From Images”, Journal of Critical
        Reviews, Vol 7, Issue 6, 2020, pp. 350-354.
[19].   R.S.M. Lakshmi Patibandla, B. Tarakeswara Rao, P. Sandhya Krishna,Venkata Rao Maddumala4
        “Medical Data Clustering Using Particle Swarm Optimization Method”, Journal of Critical
        Reviews, Vol 7, Issue 6, 2020, pp. 363-367.
[20].   J. Tejaswini, T. Mohana Kavya , R. Devi Naga Ramya , P. Sai Triveni, Venkata Rao
        Maddumala,(2020), “Accurate Loan Approval Prediction Based On Machine Learning Approach”,
        Journal of Engineering Science, Vol 11, Issue 4 , April/ 2020, pp. 523-532.
[21].   K. Yamini, K. Sai Swetha , P. Lakshmi Prasanna, M. Rupa Venkata Swathi , Venkata Rao
        Maddumala(2020), Image Colorization With Deep Convolutional Open CV, Journal of Engineering
        Science Vol 11, Issue 4 , April/ 2020, pp. 533-543.

  ISSN: 2005-4238 IJAST                                                                              9485
  Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology
                                                                   Vol. 29, No. 3, (2020), pp. 9478 - 9486

[22].   Arepalli, Gopi & Erukula, Suresh & Gopi, A.P. & Nagaraju, Chiluka. (2016). Secure multicast
        routing protocol in MANETs using efficient ECGDH algorithm. International Journal of Electrical
        and Computer Engineering (IJECE). 6. 1857-1865. 10.11591/ijece.v6i4.9941.
[23].   K. Sarada, V. Lakshman Narayana,(2020), “Improving Relevant Text Extraction Accuracy using
        Clustering Methods”, TEST Engineering and Management, Volume 83, Page Number: 15212 –
        15219.
[24].   K. Sarada, V. Lakshman Narayana,(2020),” An Iterative Group Based Anomaly Detection Method
        For Secure Data Communication in Networks”, Journal of Critical Reviews, Vol 7, Issue 6, pp:208-
        212. doi: 10.31838/jcr.07.06.39.
[25].   Banavathu Mounika, P. Anusha, V. Lakshman Narayana,(2020), “ Use of BlockChain Technology
        In Providing Security During Data Sharing”, Journal of Critical Reviews, Vol 7, Issue 6, pp:338-
        343. doi: 10.31838/jcr.07.06.59.
[26].   V. Lakshman Narayana, B. Naga Sudheer,(2020),” Fuzzy Base Artificial Neural Network Model
        For Text Extraction From Images”, Journal of Critical Reviews, Vol 7, Issue 6,pp:350-354, doi:
        10.31838/jcr.07.06.61.
[27].   V. Lakshman Narayana, A. Peda Gopi,(2020),” Accurate Identification And Detection Of Outliers
        In Networks Using Group Random Forest Methodoly”, Journal of Critical Reviews, Vol 7, Issue
        6,pp:381-384, doi: 10.31838/jcr.07.06.67.
[28].   Sandhya Pasala, V. Pavani, G. Vidya Lakshmi, V. Lakshman Narayana,(2020),” Identification Of
        Attackers Using Blockchain Transactions Using Cryptography Methods”, Journal of Critical
        Reviews, Vol 7, Issue 6,pp:368-375, doi: 10.31838/jcr.07.06.65
[29].   C.R.Bharathi, Vejendla. Lakshman Narayana , L.V. Ramesh, (2020),” Secure Data Communication
        Using Internet of Things”, International Journal of Scientific & Technology Research, Volume 9,
        Issue 04,pp:3516-3520.

  ISSN: 2005-4238 IJAST                                                                             9486
  Copyright ⓒ 2020 SERSC
You can also read