Resolving Pronouns for a Resource-Poor Language, Malayalam Using Resource-Rich Language, Tamil

Page created by Samuel Warner
 
CONTINUE READING
Resolving Pronouns for a Resource-Poor Language, Malayalam
                Using Resource-Rich Language, Tamil

                                     Sobha Lalitha Devi
                        AU-KBC Research Centre, Anna University, Chennai
                                      sobha@au-kbc.org

                     Abstract                              features. If the resources available in one language
                                                           (henceforth referred to as source) can be used to
    In this paper we give in detail how a re-              facilitate the resolution, such as anaphora, for all
    source rich language can be used for resolv-           the languages related to the language in question
    ing pronouns for a less resource language.             (target), the problem of unavailability of resources
    The source language, which is resource rich            would be alleviated.
    language in this study, is Tamil and the re-           There exists a recent research paradigm, in which
    source poor language is Malayalam, both
                                                           the researchers work on algorithms that can rap-
    belonging to the same language family,
                                                           idly develop machine translation and other tools
    Dravidian. The Pronominal resolution de-
                                                           for an obscure language. This work falls into this
    veloped for Tamil uses CRFs. Our approach
    is to leverage the Tamil language model to
                                                           paradigm, under the assumption that the language
    test Malayalam data and the processing re-             in question has a less obscure sibling. Moreover,
    quired for Malayalam data is detailed. The             the problem is intellectually interesting. While
    similarity at the syntactic level between the          there has been significant research in using re-
    languages is exploited in identifying the              sources from another language to build, for exam-
    features for developing the Tamil language             ple, parsers, there have been very little work on
    model. The word form or the lexical item is            utilizing the close relationship between the lan-
    not considered as a feature for training the           guages to produce high quality tools such as
    CRFs. Evaluation on Malayalam Wikipedia                anaphora resolution (Nakov, P and Tou Ng,H
    data shows that our approach is correct and            2012). In our work we are interested in the follow-
    the results, though not as good as Tamil, but          ing questions:
    comparable.
                                                           If two languages are from the same language fam-
1   Introduction                                           ily and have similarity in syntactic structures and
                                                           not in lexical form and script
 Natural language processing techniques of the                  1. Can the language model developed for
present day require large amounts of manually-                      one language be used for analyzing the
annotated data to work. In reality, the required                    other language?
quantity of data is available only for a few lan-               2. How the lexical form difference can be
guages of major interest. In this work we show                      resolved in using the language model?
how a resource-rich language, Tamil, can be lev-                3. How to overcome the challenges of script
eraged to resolve anaphora for a related resource-                  variation?
poor language, Malayalam. Both Tamil and Mal-              In this work we develop a language model for re-
ayalam belong to the same language family, Dra-            solving pronominals in Tamil using CRFs and us-
vidian. The methods we focus on exploits the sim-          ing the language model test another language,
ilarity at the syntactic level of the languages and        Malayalam. As said earlier, in this work, we are
anaphora resolution heavily depends on syntactic

                                                       611
                   Proceedings of Recent Advances in Natural Language Processing, pages 611–618,
                                             Varna, Bulgaria, Sep 2–4, 2019.
                                   https://doi.org/10.26615/978-954-452-056-4_072
focusing on Dravidian family of languages. His-          retained the Proto –Dravidian verbs. For example,
torically, all Dravidian languages have branched         the word for “talk” in Malayalam is “samsarik-
out from a common root Proto-Dravidian.                  kuka” “to talk” which has root in Sanskrit,
Among the Dravidian languages, Tamil is the old-         whereas the Tamil equivalent is “pesuu” “to talk”,
est language. Though there is similarity at the syn-     the root in Pro-Dravidian.
tactic level, there is no similarity in lexical form     The Syntactic Structure Level: There is lot of sim-
or at the script level among the languages. We are       ilarity at the syntactic structure level between the
motivated by the observation that related lan-           two languages. Since antecedent to anaphor has
guages tend to have similar word order and syn-          dependency on the position of the noun, the struc-
tax, but they do not have similar script or orthog-      tural similarity is a positive feature for our goal.
raphy. Hence words are not similar.                      The syntactic similarity at Sentence level, Case
                                                         maker level, pronominal distribution level are ex-
Tamil has the most resources at all levels of lin-       plained with examples.
guistics, right from morphological analyser to dis-      Case marker level: Both the languages have the
course parser and Malayalam has the least. About         same number of cases and their distribution is
the similarity of the two languages we give in de-       similar. In both the languages, nouns inflected
tail in section 2.                                       with nominative or dative case become the subject
                                                         of the sentence (Dative subject is the peculiarity
The remainder of this article is organized as fol-       of Indian languages). Accusative case denotes the
lows: Section 2 provides a detailed description of       object.
the two languages, their linguistic similarity cred-     Clausal sentences: The clause constructions in
its and differences, Section 3 presents the pro-         both the languages follow the same rule. The
nominal resolution in Tamil. In Section 4 we in-         clauses are formed by nonfinite verbs. The clauses
troduce our proposed approach on how the lan-            do not have free word order and they have fixed
guage model for Tamil can be used for resolving          positions. Order of embedding of the subordinate
Malayalam pronouns. Section 5 describes the da-          clause is same in both the languages.
tasets used for evaluation, experiments and analy-       Ex:1
sis of the results and the paper ends with the con-      (Ma) [innale vanna(vbp) kutti ]/Sub-RP-cl
clusion in Section 6.                                          {sita annu}/Main cl

2   How similar the languages Tamil and                  (Ta) [neRu      vanta(vbp) pon     ]/Sub- RP-cl
                                                              {sita aakum}/Maincl
    Malayalam Are?
As mentioned earlier, both the languages belong                [Yesterday came girl    ]/subcl
                                                              {Sita is}/ Maincl
to the Dravidian family and are relatively free
word order languages, inflectional (rich in mor-              (The girl who came yesterday is Sita)
phology), agglutinative and verb final. They have
Nominative and Dative subjects. The pronouns             As can be seen from the above example, the basic
have inherent gender marking as in English and           syntactic structure is the same in both the lan-
have the same lexical form “avan” “he”, “aval”           guages. The above example is a two clause sen-
“she” and “atu” “it” both in Tamil and Malaya-           tence with a relative participial clause and a main
lam. Though the pronouns have same lexical form          clause. The relative participial clause is formed by
and meaning, it can be said that there is no lexical     the nonfinite verb (vbp). Using the same example
similarity between the two languages. The simi-          we can find the pronominal distribution.
larity between two languages can be at three lev-
                                                          Ex: 2
els, a) writing script, b) the word forms and c) the          (Ma) [innale vanna(vbp) avali (PRP)
syntactic structure.                                                ]/Sub-RP-cl {sitai annu}/Main cl
                                                               (Ta) [neRu       vanta(vbp) avali (PRP)
Script Level: The two languages have different                       ]/Sub-RP-cl {sitai aakum}/Maincl
writing form, though the base is from Grandha                       [Yesterday came         shei (PRP)
script. Hence no similarity at the script level.                    ]/subcl        { Sitai is} / Maincl
The Word Level: There is no similarity at the lex-             (The she who came yesterday is Sita)
ical level between the two languages. The San-
skritization of Malayalam contributed to have            In the above example the pronoun “aval” “she”
more Sanskrit verbs in Malayalam whereas Tamil           occurs at the same position in both the languages

                                                       612
and the antecedent “sita” also occurs at the same             Students(N) school(N)+dat go(V)+present+3pl
position as shown by co-indexing. Consider an-                    (Students are going to the school)
other example.                                              4b. avarkaLi veekamaaka natakkinranar.
                                                              They(PN) fast(ADV)         walk(V)+present+3pl
 Ex:3                                                              (They are walking fast.)
 (Ma). sithaai kadaikku pooyi. avali pazham             Considering Ex 4a and Ex.4b, sentence Ex.4b has
        Vaangicchu(Vpast)                               3rd person plural pronoun ‘avarkaL’ as the subject
 (Ta). sithaai kadaikku cenRaal. avali pazham           and it refers to plural noun ‘maaNavarkaL’ which
        Vaangkinaal(V,past,+F,+Sg).                     is the subject in Ex.4a.
        Sita shop          went.   She fruit
       bought                                           Ex:5
       (Sitai went to the shop. Shei bought fruit.)     5a. raamuvumi       giitavumj       nanparkaL.
In the above example there are two sentences and              Raamu(N)+INC Gita(N)+INC friends(N)
pronoun is in one sentence and antecedent is in                     (Ramu and Gita are friends.)
another. Here you can see the distribution of the
pronoun “aval” and where the antecedent “sita” is       5b. avani ettaam vakuppil padikkiraan.
                                                            He(PN) eight(N) class(N) study(V)+present+3sm
occurring. Though Tamil has number, gender and
                                                               (He studies in eight standard.)
person agreement between subject and verb and
Malayalam does not have, this cannot be consid-             5c. avaLumj ettaam vakuppil padikkiaal.
ered as a grammatical feature which can be used               She(PN) eight(N) class(N) study(V)+present+3sf
for identifying the antecedent of an anaphor. This                  (She also studies in eight standard.)
grammatical variation does not have an impact on
                                                        In Ex.5b, 3rd person masculine pronoun ‘avan’ oc-
the identification of pronoun and antecedent rela-
                                                        curs as the subject and it refers to the masculine
tions. From the above examples we can see that          noun ‘raamu’, subject noun in Ex.5a. Similarly,
the two languages have the same syntactic struc-
                                                        3rd person feminine pronoun ‘avaL’ in Ex.5c re-
ture at the clause and sentence level. We are ex-
                                                        fers to feminine noun ‘giita’ in Ex.3a. ‘atu’,
ploiting this similarity between the two languages
                                                        which is a 3rd person neuter pronoun, will also oc-
to achieve our goal. We find that using this simi-
                                                        curs as genitive/possessive case marker. Consider
larity between the languages, the language model        the following example.
of Tamil can be used to resolve pronouns in Mal-
                                                        3.1.1 Non-anaphoric Pronouns
ayalam.                                                 The pronouns can also occur as generic mentions
                                                        without having referent. In English it known
3   Pronoun Resolution in Tamil                         as‘pleonastic it’.
3.1 Pronouns in Tamil                                   Ex:6.
                                                                    atu oru          malaikalam.
In this section, we analyse in detail the pronomi-
                                                                    It(PN) one(Qc) rainy_season (N)
nal expressions in Tamil. Pronouns are the words                   (It was a rainy season.)
used as a substitution to nouns, that are already
mentioned or that is already known. There are           In Ex.6, the 3rd person neuter pronoun ‘atu’ (it) do
pronouns which do not refer. Pronouns in Tamil          not have a referent. Here ‘atu’ is equivalent to ple-
have person (1st, 2nd, 3rd person) and number (sin-     onastic ‘it’ in English
gular, plural) distinction. Masculine, feminine and
neuter gender distinctions are clearly marked in        3.1.2 Corpus annotation
3rd person pronouns, whereas in 1st and 2nd person
pronouns there is no distinction of masculine,          We collected 600 News articles from various
feminine and neuter gender. In this work we con-        online Tamil News wires. The News articles are
sider only third person pronouns. Third person          from Sports, Disaster and General News domains.
pronouns in Tamil have inherent gender and as in        The anaphoric expressions are annotated along
English and they are “avan” he, “aval” she and          with its antecedents using graphical tool,
“atu” it. In this work, we resolve 3rd person pro-      PAlinkA, a highly customisable tool for Dis-
nouns. The distribution of pronouns in various          course Annotation (Orasan, 2003) which we cus-
syntactic constructions is explained with exam-         tomized for Tamil. We have used two tags
ples below.                                             namely, MARKABLE and COREF. The corpus
 Ex:4                                                   used for training is 54,563 words which are anno-
 4a. maaNavarkaLi paLLikku           celkiranar.        tated for anaphora –antecedent pairs and testing

                                                      613
corpus is 10,912 words. We have calculated the          approaches using Centering theory for Hindi.
inter-annotator agreement which is the degree of        Sobha et al., (2007) presented a salience factor
agreement among annotators. We have used Co-            based with limited shallow parsing of text. Aki-
hen’s kappa as the agreement statistics. The kappa      landeswari et al., (2013) used CRFs for resolution
coefficient is generally regarded as the statistics     of third person pronoun. Ram et al., (2013) used
of choice for measuring agreement on ratings            Tree CRFs for anaphora resolution for Tamil with
made on a nominal scale. We got a Kappa score           features from dependency parsed text. In most of
of 0.87. The difference between the annotators          the published works resolution of third person
were analysed and found the variation in annota-        pronoun was considered and it is a non-trivial
tion. It occurred in the marking of antecedents for     task.
pronominal. This is common in sentences with            Pronoun resolution engine does the task of identi-
clausal inversion, and genitive drop.                   fying the antecedents of the pronouns. The Pro-
                                                        noun resolution is built using Conditional Ran-
3.2   Pronoun Resolution System                         dom Fields (CRFs) technique. Though CRFs is
Early works in anaphora resolution by Hobbs             notable for sequence labelling task, we used this
(1978), Carbonell and Brown (1988), Rich and            technique to classify the correct anaphor-anteced-
Luper Foy (1988) etc. were mentioned as                 ent pair from the possible candidate NP pairs by
knowledge intensive approach, where syntactic,          presenting the features of the NP pair and by
semantic information, world knowledge and case          avoiding the transition probability. While training
frames were used. Centering theory, a discourse         we form positive pairs by pairing anaphoric pro-
based approach for anaphora resolution was pre-         noun and correct antecedent NP and negative
sented by Grosz (1977), Joshi and Kuhn (1979).          pairs by pairing anaphoric pronouns and other
Salience feature based approaches were presented        NPs which match in person, number and gender
by Lappin and Leass (1994), Kennedy Boguraev            (PNG) information with the anaphoric pronoun.
(1996) and Sobha et al., (2000). Indicator based,       These positive and negative pairs are fed to the
knowledge poor method for anaphora resolution           CRFs engine and the language model is gener-
methods were presented by Mitkov (1997, 1998).          ated. While testing, when an anaphoric pronoun
One of the early works using machine learning           occurs in the sentence, the noun phrases which
technique was Dagan Itai’s (1990) unsupervised          match in PNG with the pronoun, that occur in the
approach based on co-occurrence words. With the         preceding portion of the sentence and the four pre-
use of machine learning techniques researchers          ceding sentences are collected and paired with the
work on anaphora resolution and noun phrase             anaphoric pronoun and presented to CRFs engine
anaphora resolution simultaneously. The other           to identify the correct anaphor-antecedent pair.
machine learning approaches for anaphora resolu-        3.2.1 Pre-processing
tion were the following. Aone and Bennett (1995),       The input document is processed with a sentence
McCarty and Lahnert (1995), Soon et al., (2001),        splitter and tokeniser to split the document into
Ng and Cardia (2002) had used decision tree             sentences and the sentences into individual tokens
based classifier. Anaphora resolution using CRFs        which include words, punctuation markers and
was presented by McCallum and Wellner (2003)            symbols. The sentence split and tokenized docu-
for English, Li et al., (2008) for Chinese and          ments are processed with Syntactic Processing
Sobha et al., (2011, 2013) for English and Tamil.       modules. Syntactic processing modules include
In Indian languages anaphora resolution engines         Morphological analyser, Part-of-Speech tagger
are demonstrated only in few languages such as          and Chunker. These modules are developed in
Hindi, Bengali, Tamil, and Malayalam. Most of           house.
the Indian languages do not have parser and other       a) Morphological Analyser: Morphological
sophisticated pre-processing tools. The earliest             analysis processes the word into component
work in Indian language, ‘Vasisth’ was a rule                morphemes and assigning the correct
based multilingual anaphora resolution platform              morpho-syntactic information. The Tamil
by Sobha and Patnaik (1998, 2000, 2002), where               morphological Analyser is developed using
the morphological richness of Malayalam and                  paradigm based approach and implemented
Hindi were exploited without using full-parser.              using Finite State Automata (FSA) (Vijay
The case marker information is used for identify-            Sundar et.al 2010). The words are classified
ing subject, object, direct and in-direct object.            according to their suffix formation and are
Prasad and Strube (2000), Uppalapu et al., (2009)            marked as paradigms. The number of
and Dekwale et al., (2013) had presented different           paradigms used in this system is Noun

                                                      614
paradigms: 32; Verb Paradigms: 37;                   The features for all possible candidate antecedent
    Adjective Paradigms: 4; Adverb Paradigms:            and pronoun pairs are obtained by pre-processing
    1. A root word dictionary with 1, 52,590 root        the input sentences with morphological analyser,
    words is used for developing the                     POS tagger, and chunker. The features identified
    morphological anlyser. The morphological             can be classified as positional and syntactic fea-
    analyser is tested with 12,923 words and the         tures
    system processed 12,596 words out of which           Positional Features: The occurrence of the can-
    it correctly tagged 12,305. The Precision is         didate antecedent is noted in the same sentence
    97.69% and Recall = 97.46%. MA returns all           where the pronoun occurs or in the prior sentences
    possible parse for a given word.                     or in prior four sentences from the current sen-
                                                         tence.
b) Part Of Speech Tagger (POS tagger): Part              Syntactic Features: The syntactic arguments of
   of Speech tagger disambiguates the multiple           the candidate noun phrases in the sentence are a
   parse given by the morphological analyser,            key feature. The arguments of the noun phrases
   using the context in which a word occurs. The         such as subject, object, indirect object, are ob-
   Part of speech tagger is developed using the          tained from the case suffix affixed with the noun
   machine learning technique Conditional                phrase. As mentioned in section 2 the subject of a
   Random fields (CRF++) (Sobha L, et.al                 sentence can be identified by the case marker it
   2010). The features used for machine learning         takes. We use morphological marking for the
   is a set of linguistic suffix features along with     above.
   statistical suffixes and uses a window of 3           a) PoS tag and chunk tag of Candidate NP, case
   words. We have used 4, 50,000 words, which                marker tags of the noun.
   are tagged using BIS POS tags. The system             b) The suffixes which show the gender which
   performs with recall 100% and Average                     gets attached to the verb.
   Precision of 95.16%.
c) Noun and Verb Phrase Chunker: Chunking                3.3.2 Development of Tamil Language Model
   is the task of grouping grammatically related
   words into chunks such as noun phrase, verb           We used 600 Tamil Newspaper articles for build-
   phrase, adjectival phrase etc. The system is          ing the language model. The preparation of the
   developed using the machine learning                  training data is described below. The raw corpus
   technique,          Conditional          Random       is processed with sentence splitter and tokeniser.
   fields(CRF++) (Sobha L et.al 2010). The fea-          The tokenized corpus is then preprocessed with
   tures used are the POS tag, Word and window           shallow processing modules, namely, morpholog-
   of 5 words. Training Corpus is 74,000 words.          ical analyser, part-of-speech tagger and chunker.
   The recall is 100%. Average Precision of              The training data is prepared from this processed
   92.00%.                                               corpus. For each pronoun, the Noun phrase (NP)
                                                         preceding the pronouns and in the NPs in preced-
3.3 Pronoun Resolution Engine                            ing sentence till correct antecedent NP, which
                                                         match in Person, number and Gender (PNG) are
In both training and testing phase, the noun             selected for training. The above features are used
phrases (NP) which match with the PNG of the             for CRF for learning. The system was evaluated
pronoun are considered. The features are ex-             with data from the web and the result is given be-
tracted from these NPs. In the training phase the        low.
positive and negative pairs are marked and fed to            Domain    Testing    Precision      Recall
the ML engine for generating a language model.                         Corpus     (%)            (%)
In the testing phase these NPs with its features are                   (Words)
input to the language model to identify the ante-             News      10,912     86.2            66.67
cedent of a pronoun. Here we have not taken the              data
lexical item or the word as a feature. We have
used only the grammatical tags as feature. The                Table 1: Pronominal Resolution (CRFs engine)
features selected represent the syntactic position
of the anaphor –antecedent occurrence.                   4. Resolution of Pronouns in Malayalam Cor-
3.3.1 Features Selection                                 pus using Tamil Language Model
The features required for machine learning are
identified from shallow parsed input sentences.

                                                       615
In this section, we present in detail how Malaya-         1. [determiner] [quantifier][intensifier] [classi-
lam is tested using Tamil language model. Here               fier][adjective] {Head Noun}. Here Head
Malayalam data is pre-processed as per the re-               noun is obligatory and others are all optional
quirement of the Tamil language model test data.          2. NN+NN combination
The test data required four grammatical infor-
mation, i) POS, ii) the case marker, iii) the number      3. NN with no suffix+NN with no suffix……
gender and person and iv)chunk information. In               +NN with suffix or without suffix.
the introduction we have asked three questions on
how to use a language model in source language            Using the above rules we identified the noun
be used for testing a target language. The three          chunks in Malayalam. The above discussed pre-
questions are dealt one by one below.                     processing gives an accuracy of 66% for POS and
     1. Can the language model developed for              suffix tagging and 63% for chunking.
         one language be used for analyzing the               2. How the lexical form difference can be
         other language?                                           resolved in using the language model?
In this study we have used the language model of          The second question we asked is about the lexical
the source language Tamil. The features used to           form or words which are not similar in both the
develop this language model are POS tag, Chunk            languages and how this can be resolved. The anal-
tag and the case/ suffix tags. The word form was          ysis of Tamil has shown that the syntactic struc-
not considered as a feature. Since these are the          ture of the language has more prominence over the
features used for learning, the test data also should     words in the resolution of anaphors. Hence we
have these information. As said earlier, Malaya-          have taken the syntactic features and did not take
lam is not a resource rich language and it does not       word as a feature. The system learned only the
have pre-processing engines such as POS tagger,           structure patterns.
Chunker and morphological analysers with high                 3. How to overcome the challenges of script
accuracy. Hence we developed a very rudimen-                       variation?
tary preprocessing systems which can give the             Since word feature is not considered the script do
POS tag, case/suffix tags and chunk tags.                 not pose any challenges. Still to have the same
The POS information: The POS and suffix infor-            script we converted the two languages into one
mation are assigned to the corpus using a root            form the WX notation. This helped in having the
word dictionary with part of speech information           same representation of the languages.
and a suffix dictionary.
The dictionary has the root words which include                 4. Testing with Malayalam Data
all types of pronouns and contains nearly 66,000          We selected 300 articles from Malayalam Wik-
root words. The grammatical information (POS)             ipedia, which were on different genre and size.
such as noun, verb, adjective, pronoun and num-           The 300 Malayalam documents from Wikipedia
ber gender person (PNG) information are given             has 7600 sentences with 3660 3rd person pro-
for a word in the dictionary.                             nouns. The distribution of the pronouns is pre-
The suffix dictionary contains all possible suf-          sented in Table 2.
fixes which a root word can take. The suffix in-                                            Number of
cludes the changes in sandhi when added to the                Pronoun                       Occurrences with
root word. The suffixes are of two types, i) that                                           its Inflected forms
which gets attached to nouns called the case suf-             avan (3rd person masculine
                                                                                                            1120
fixes and 2) that which gets attached to verbs                singular)
called the TAM (Tense, Aspect, and Modal). Us-                aval (3rd person feminine
                                                                                                              840
                                                              singular)
ing this suffix dictionary we can identify the POS
                                                              avar (3rd person honorific)                     420
of the word even if the word is not present in the
                                                              athu (3rd person nuetor
root word dictionary. The suffix dictionary has                                                             1280
                                                              singular)
1,00,000 unique entries.                                      Total                                         3660
The noun chunk information is given by a rule                 Table 2. Distribution of 3rd person pronouns in the
based chunker which works on three linguistic                                       corpus
rules. The noun phrases alone are required for            As discussed in the earlier section, the prepro-
anaphora resolution. Chunks are identified using          cessing done for Tamil using syntactic module are
the basic linguistic rule for Noun phrases as given       morphological information, POS and Chunking
below                                                     information. Hence the same pre-processing in-
                                                          formation is necessary for Malayalam data as

                                                        616
well. The documents are initially preprocessed                                                  Precisi   Recal
with a sentence splitter and tokenizer. The to-              Type of pronoun
                                                                                                on (%)    l (%)
kenized documents are pre-processed using the                avan (3rd person masculine
                                                                                                70.83     69.52
pos, suffix and chunking systems discussed above             singular)
to enrich the text with syntactic information.               avaL (3rd person feminine
                                                                                                69.34     68.56
For each pronoun, we identify the possible candi-            singular)
date antecedents. Those noun phrases that occur              avar/avarkal (3rd person
                                                                                                65.45     70.34
preceding the pronoun in the current sentences               plural/honorofic)
and preceding four sentences, which match in the             atu (3rd person nuetor singular)   56.67     65.67
person, number gender (PNG) with the select pro-             Total                              68.45     67.34
noun are identified as possible candidates. For                            Table 3: Evaluation Results
these possible candidates we extract the features        6. Conclusion
required for CRFs techniques as explained in the         In this paper we explained a method to use high
previous section. After extraction of features for       resource language to resolve anaphors in less re-
selected candidate antecedents, the antecedent is        source language. In this experiment the high re-
identified by using language model built using           source language is Tamil and the less resource
Tamil data. The results are encouraging with 67%         language is Malayalam. The results are encourag-
accuracy which is a respectable score. The errors        ing. The model needs to be tested with more data
and evaluation is given in detail in the next sec-       as future work.
tion.
                                                         References
5. Experiment and Discussion                             Carbonell J. G., and Brown R. D. 1988. Anaphora res-
                                                           olution: A multi-       strategy approach. In: 12th
 The experiment showed that the resolution of the          International Conference on Computational Lin-
pronouns “avan” he and “aval” she is similar to            guistics, 1988, pp. 96-101.
that of Tamil documents. The issues related to
split antecedent is not addressed and hence pro-         Dagan I., and Itai. A. 1990. Automatic processing of
nouns which are referring to coordinated nouns             large corpora for the resolution of anaphora refer-
                                                           ences. In: 13th conference on Computational lin-
were not resolved. The pronoun which is less re-
                                                           guistics, Vol. 3, Helsinki, Finland, pp.330-332.
solved is the third person neuter pronouns “atu”
compared to other pronouns. The third person             Dakwale. P., Mujadia. V., Sharma. D.M. 2013. A Hy-
neuter pronoun usually has more number of pos-             brid Approach for Anaphora Resolution in Hindi.
sible candidates, which leads to poor resolution.          In: Proc of International Joint Conference on Natu-
Consider the following example.                            ral Language Processing, Nagoya, Japan, pp.977–
                                                           981.
Ex:7
avan        joli       ceytha          jolikkarkku       Li., F., Shi., S., Chen., Y., and Lv, X. 2008. Chinese
He(PRP) work(N) do(V)+past+RP worker(N)+pl+dat              Pronominal Anaphora Resolution Based on Condi-
 vellam    kotuthu.                                         tional           Random Fields. In: International
 water(N) give(V)+past                                      Conference on Computer Science and Software En-
  (He gave water to the workers who did the work.)          gineering, Washington, DC, USA, pp. 731-734.

atu       nallatu      aayirunnu                         Hobbs J. 1978. Resolving pronoun references. Lingua
It(PRP) good(N)          is (Copula V) +past.              44, pp. 339-352.
     (It was good.)                                      Grosz, B. J. 1977. The representation and use of focus
                                                           in dialogue understanding. Technical Report 151,
In this example, the pronoun 'atu' refers to               SRI International, 333 Ravenswood Ave, Menlo
'vellam” ‘water’; in the previous sentence. In the         Park, Ca. 94025.
previous sentence there are two possible candi-
                                                         Joshi A. K., and Kuhn S. 1979. Centered logic: The
date antecedents 'vellam' and 'joli 'which are 3rd          role of entity centered sentence representation in
person nouns. The ML engine chooses 'joli',                 natural language inferencing. In: International Joint
which is in the initial position of the sentence and        Conference on Artificial Intelligence.
in the subject position. When the antecedent is in
                                                         Kennedy, C., Boguraev, B. 1996 Anaphora for Every-
the object position the engine has not identified it
                                                           one: Pronominal Anaphora Resolution without a
properly. The following table gives the results of
pronouns.

                                                       617
Parser. In: 16th International Conference on Com-          Russell, B. 1919, “On Propositions: What They Are
  putational Linguistics COLING’96, Copenhagen,                and How They Mean,” Proceedings of the Aristote-
  Denmark, pp. 113–118.                                        lian Society, Supplementary Volume 2: 1–43; also
                                                               appearing in Collected Papers, Vol. 8
Lappin S., and Leass H. J. 1994. An algorithm for pro-
  nominal anaphora resolution. Computational Lin-            Senapati A., Garain U. 2013. GuiTAR-based Pronomi-
  guistics 20 (4), pp. 535-561.                                nal Anaphora Resolution in Bengal. In: Proceedings
                                                               of 51st Annual Meeting of the Association for Com-
McCallum A., and Wellner. B. 2003. Toward condi-
                                                               putational Linguistics, Sofia, Bulgaria pp 126–130.
  tional models of identity uncertainty with applica-
  tion to proper noun coreference. In Proceedings of         Sikdar U.K, Ekbal A., Saha S., Uryupina O., Poesio M.
  the IJCAI Workshop on Information Integration on              2013. Adapting a State-of-the-art Anaphora Resolu-
  the Web, pp. 79–84.                                           tion System for Resource-poor Language. In pro-
                                                                ceedings of International Joint Conference on Natu-
McCarthy, J. F. and Lehnert, W. G. 1995. Using deci-
                                                                ral Language Processing, Nagoya, Japan pp 815–
  sion trees for coreference resolution. In C. Mellish
                                                                821.
  (Ed.), Fourteenth International Conference on Arti-
  ficial Intelligence, pp. 1050-1055                         Sobha L. and Patnaik B. N. 2000. Vasisth: An Anaph-
                                                               ora Resolution System for Indian Languages. In
Mitkov R. 1998. Robust pronoun resolution with lim-
                                                               Proceedings of International      Conference on Ar-
  ited knowledge. In: 17th International Conference
                                                               tificial and Computational Intelligence for Decision,
  on     Computational    Linguistics    (COLING’
                                                               Control and Automation in Engineering and Indus-
  98/ACL’98), Montreal, Canada, pp. 869-875.
                                                               trial Applications, Monastir, Tunisia.
Mitkov, R. 1997. "Factors in anaphora resolution: they
                                                             Sobha L. and Patnaik,B.N. 2002. Vasisth: An anaphora
  are not the only things that matter. A case study
                                                               resolution system for Malayalam and Hindi. In Pro-
  based on two different approaches". In Proceedings
                                                               ceedings of Symposium on Translation Support
  of the ACL'97/EACL'97 workshop on Operational
                                                               Systems.
  factors in practical, robust anaphora resolution, Ma-
  drid, Spain.                                               Sobha L. 2007. Resolution of Pronominals in Tamil.
                                                               Computing Theory and Application, The IEEE
Ng V., and Cardie C. 2002. Improving machine learn-
                                                               Computer Society Press, Los Alamitos, CA, pp.
  ing approaches to coreference resolution. In. 40th
                                                               475-79.
  Annual Meeting of the Association for Computa-
  tional Linguistics, pp. 104-111.                           Sobha L., Sivaji Bandyopadhyay, Vijay Sundar Ram
                                                               R., and Akilandeswari A. 2011. NLP Tool Contest
Prasad R., and Strube,M.,2000. Discourse Salience
                                                               @ICON2011 on Anaphora Resolution in Indian
   and Pronoun Resolution in Hindi, Penn Working Pa-
                                                               Languages. In: Proceedings of ICON 2011.
   pers in Linguistics, Vol 6.3, pp. 189-208.
                                                             Sobha Lalitha Devi and Pattabhi R K Rao. 2010. Hy-
Orasan, C. 2003, PALinkA: A highly customisable tool
                                                               brid Approach for POS Tagging for Relatively Free
  for discourse annotation. SIGDIAL Workshop
                                                               Word Order Languages. In Proceedings of
  2003: 39-43
                                                               Knowledge Sharing Event on Part-Of-Speech Tag-
Preslav Nakov Hwee Tou Ng 2012. Improving Statis-              ging, CIIL, Mysore.
   tical Machine Translation for a Resource-Poor Lan-
                                                             Vijay Sundar Ram and Sobha Lalitha Devi. 2010. Noun
   guage Using Related Resource-Rich Languages;
                                                                Phrase Chunker Using Finite State Automata for an
   Journal of Artificial Intelligence Research 44 (2012)
                                                                Agglutinative Language. In Proceedings of the
   179-222;
                                                                Tamil Internet – 2010, Coimbatore, India, 218–
Ram, R.V.S. and Sobha Lalitha Devi. 2013."Pronominal            224.
  Resolution in Tamil Using Tree CRFs", In Proceed-
                                                             Soon W. H., Ng, and Lim D. 2001. A machine learning
  ings of 6th Language and Technology Conference,
                                                               approach to coreference resolution of noun phrases.
  Human Language Technologies as a challenge for
                                                               Computational Linguistics 27 (4), pp.521-544.
  Computer Science and Linguistics - 2013, Poznan,
  Poland                                                     Taku Kudo. 2005. CRF++, an open source toolkit for
                                                               CRF, http://crfpp.sourceforge.net .
Rich, E. and LuperFoy S.,1988 An architecture for
  anaphora resolution. In: Proceedings of the Second         Uppalapu. B., and Sharma, D.M. 2009. Pronoun Reso-
  Conference on Applied Natural Language Pro-                  lution For Hindi. In: Proceedings of 7th Discourse
  cessing, Austin, Texas.                                      Anaphora and Anaphor Resolution Colloquium
                                                               (DAARC 09), pp. 123-134.

                                                           618
You can also read