Resolving Pronouns for a Resource-Poor Language, Malayalam Using Resource-Rich Language, Tamil

Page created by Samuel Warner

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Resolving Pronouns for a Resource-Poor Language, Malayalam
                Using Resource-Rich Language, Tamil

                                     Sobha Lalitha Devi
                        AU-KBC Research Centre, Anna University, Chennai
                                      sobha@au-kbc.org

                     Abstract                              features. If the resources available in one language
                                                           (henceforth referred to as source) can be used to
    In this paper we give in detail how a re-              facilitate the resolution, such as anaphora, for all
    source rich language can be used for resolv-           the languages related to the language in question
    ing pronouns for a less resource language.             (target), the problem of unavailability of resources
    The source language, which is resource rich            would be alleviated.
    language in this study, is Tamil and the re-           There exists a recent research paradigm, in which
    source poor language is Malayalam, both
                                                           the researchers work on algorithms that can rap-
    belonging to the same language family,
                                                           idly develop machine translation and other tools
    Dravidian. The Pronominal resolution de-
                                                           for an obscure language. This work falls into this
    veloped for Tamil uses CRFs. Our approach
    is to leverage the Tamil language model to
                                                           paradigm, under the assumption that the language
    test Malayalam data and the processing re-             in question has a less obscure sibling. Moreover,
    quired for Malayalam data is detailed. The             the problem is intellectually interesting. While
    similarity at the syntactic level between the          there has been significant research in using re-
    languages is exploited in identifying the              sources from another language to build, for exam-
    features for developing the Tamil language             ple, parsers, there have been very little work on
    model. The word form or the lexical item is            utilizing the close relationship between the lan-
    not considered as a feature for training the           guages to produce high quality tools such as
    CRFs. Evaluation on Malayalam Wikipedia                anaphora resolution (Nakov, P and Tou Ng,H
    data shows that our approach is correct and            2012). In our work we are interested in the follow-
    the results, though not as good as Tamil, but          ing questions:
    comparable.
                                                           If two languages are from the same language fam-
1   Introduction                                           ily and have similarity in syntactic structures and
                                                           not in lexical form and script
 Natural language processing techniques of the                  1. Can the language model developed for
present day require large amounts of manually-                      one language be used for analyzing the
annotated data to work. In reality, the required                    other language?
quantity of data is available only for a few lan-               2. How the lexical form difference can be
guages of major interest. In this work we show                      resolved in using the language model?
how a resource-rich language, Tamil, can be lev-                3. How to overcome the challenges of script
eraged to resolve anaphora for a related resource-                  variation?
poor language, Malayalam. Both Tamil and Mal-              In this work we develop a language model for re-
ayalam belong to the same language family, Dra-            solving pronominals in Tamil using CRFs and us-
vidian. The methods we focus on exploits the sim-          ing the language model test another language,
ilarity at the syntactic level of the languages and        Malayalam. As said earlier, in this work, we are
anaphora resolution heavily depends on syntactic

                                                       611
                   Proceedings of Recent Advances in Natural Language Processing, pages 611–618,
                                             Varna, Bulgaria, Sep 2–4, 2019.
                                   https://doi.org/10.26615/978-954-452-056-4_072

focusing on Dravidian family of languages. His- retained the Proto –Dravidian verbs. For example,
torically, all Dravidian languages have branched the word for “talk” in Malayalam is “samsarik-
out from a common root Proto-Dravidian. kuka” “to talk” which has root in Sanskrit,
Among the Dravidian languages, Tamil is the old- whereas the Tamil equivalent is “pesuu” “to talk”,
est language. Though there is similarity at the syn- the root in Pro-Dravidian.
tactic level, there is no similarity in lexical form The Syntactic Structure Level: There is lot of sim-
or at the script level among the languages. We are ilarity at the syntactic structure level between the
motivated by the observation that related lan- two languages. Since antecedent to anaphor has
guages tend to have similar word order and syn- dependency on the position of the noun, the struc-
tax, but they do not have similar script or orthog- tural similarity is a positive feature for our goal.
raphy. Hence words are not similar. The syntactic similarity at Sentence level, Case
maker level, pronominal distribution level are ex-
Tamil has the most resources at all levels of lin- plained with examples.
guistics, right from morphological analyser to dis- Case marker level: Both the languages have the
course parser and Malayalam has the least. About same number of cases and their distribution is
the similarity of the two languages we give in de- similar. In both the languages, nouns inflected
tail in section 2. with nominative or dative case become the subject
of the sentence (Dative subject is the peculiarity
The remainder of this article is organized as fol- of Indian languages). Accusative case denotes the
lows: Section 2 provides a detailed description of object.
the two languages, their linguistic similarity cred- Clausal sentences: The clause constructions in
its and differences, Section 3 presents the pro- both the languages follow the same rule. The
nominal resolution in Tamil. In Section 4 we in- clauses are formed by nonfinite verbs. The clauses
troduce our proposed approach on how the lan- do not have free word order and they have fixed
guage model for Tamil can be used for resolving positions. Order of embedding of the subordinate
Malayalam pronouns. Section 5 describes the da- clause is same in both the languages.
tasets used for evaluation, experiments and analy- Ex:1
sis of the results and the paper ends with the con- (Ma) [innale vanna(vbp) kutti ]/Sub-RP-cl
clusion in Section 6. {sita annu}/Main cl

2 How similar the languages Tamil and (Ta) [neRu vanta(vbp) pon ]/Sub- RP-cl
{sita aakum}/Maincl
Malayalam Are?
As mentioned earlier, both the languages belong [Yesterday came girl ]/subcl
{Sita is}/ Maincl
to the Dravidian family and are relatively free
word order languages, inflectional (rich in mor- (The girl who came yesterday is Sita)
phology), agglutinative and verb final. They have
Nominative and Dative subjects. The pronouns As can be seen from the above example, the basic
have inherent gender marking as in English and syntactic structure is the same in both the lan-
have the same lexical form “avan” “he”, “aval” guages. The above example is a two clause sen-
“she” and “atu” “it” both in Tamil and Malaya- tence with a relative participial clause and a main
lam. Though the pronouns have same lexical form clause. The relative participial clause is formed by
and meaning, it can be said that there is no lexical the nonfinite verb (vbp). Using the same example
similarity between the two languages. The simi- we can find the pronominal distribution.
larity between two languages can be at three lev-
Ex: 2
els, a) writing script, b) the word forms and c) the (Ma) [innale vanna(vbp) avali (PRP)
syntactic structure. ]/Sub-RP-cl {sitai annu}/Main cl
(Ta) [neRu vanta(vbp) avali (PRP)
Script Level: The two languages have different ]/Sub-RP-cl {sitai aakum}/Maincl
writing form, though the base is from Grandha [Yesterday came shei (PRP)
script. Hence no similarity at the script level. ]/subcl { Sitai is} / Maincl
The Word Level: There is no similarity at the lex- (The she who came yesterday is Sita)
ical level between the two languages. The San-
skritization of Malayalam contributed to have In the above example the pronoun “aval” “she”
more Sanskrit verbs in Malayalam whereas Tamil occurs at the same position in both the languages

612

and the antecedent “sita” also occurs at the same Students(N) school(N)+dat go(V)+present+3pl
position as shown by co-indexing. Consider an- (Students are going to the school)
other example. 4b. avarkaLi veekamaaka natakkinranar.
They(PN) fast(ADV) walk(V)+present+3pl
Ex:3 (They are walking fast.)
(Ma). sithaai kadaikku pooyi. avali pazham Considering Ex 4a and Ex.4b, sentence Ex.4b has
Vaangicchu(Vpast) 3rd person plural pronoun ‘avarkaL’ as the subject
(Ta). sithaai kadaikku cenRaal. avali pazham and it refers to plural noun ‘maaNavarkaL’ which
Vaangkinaal(V,past,+F,+Sg). is the subject in Ex.4a.
Sita shop went. She fruit
bought Ex:5
(Sitai went to the shop. Shei bought fruit.) 5a. raamuvumi giitavumj nanparkaL.
In the above example there are two sentences and Raamu(N)+INC Gita(N)+INC friends(N)
pronoun is in one sentence and antecedent is in (Ramu and Gita are friends.)
another. Here you can see the distribution of the
pronoun “aval” and where the antecedent “sita” is 5b. avani ettaam vakuppil padikkiraan.
He(PN) eight(N) class(N) study(V)+present+3sm
occurring. Though Tamil has number, gender and
(He studies in eight standard.)
person agreement between subject and verb and
Malayalam does not have, this cannot be consid- 5c. avaLumj ettaam vakuppil padikkiaal.
ered as a grammatical feature which can be used She(PN) eight(N) class(N) study(V)+present+3sf
for identifying the antecedent of an anaphor. This (She also studies in eight standard.)
grammatical variation does not have an impact on
In Ex.5b, 3rd person masculine pronoun ‘avan’ oc-
the identification of pronoun and antecedent rela-
curs as the subject and it refers to the masculine
tions. From the above examples we can see that noun ‘raamu’, subject noun in Ex.5a. Similarly,
the two languages have the same syntactic struc-
3rd person feminine pronoun ‘avaL’ in Ex.5c re-
ture at the clause and sentence level. We are ex-
fers to feminine noun ‘giita’ in Ex.3a. ‘atu’,
ploiting this similarity between the two languages
which is a 3rd person neuter pronoun, will also oc-
to achieve our goal. We find that using this simi-
curs as genitive/possessive case marker. Consider
larity between the languages, the language model the following example.
of Tamil can be used to resolve pronouns in Mal-
3.1.1 Non-anaphoric Pronouns
ayalam. The pronouns can also occur as generic mentions
without having referent. In English it known
3 Pronoun Resolution in Tamil as‘pleonastic it’.
3.1 Pronouns in Tamil Ex:6.
atu oru malaikalam.
In this section, we analyse in detail the pronomi-
It(PN) one(Qc) rainy_season (N)
nal expressions in Tamil. Pronouns are the words (It was a rainy season.)
used as a substitution to nouns, that are already
mentioned or that is already known. There are In Ex.6, the 3rd person neuter pronoun ‘atu’ (it) do
pronouns which do not refer. Pronouns in Tamil not have a referent. Here ‘atu’ is equivalent to ple-
have person (1st, 2nd, 3rd person) and number (sin- onastic ‘it’ in English
gular, plural) distinction. Masculine, feminine and
neuter gender distinctions are clearly marked in 3.1.2 Corpus annotation
3rd person pronouns, whereas in 1st and 2nd person
pronouns there is no distinction of masculine, We collected 600 News articles from various
feminine and neuter gender. In this work we con- online Tamil News wires. The News articles are
sider only third person pronouns. Third person from Sports, Disaster and General News domains.
pronouns in Tamil have inherent gender and as in The anaphoric expressions are annotated along
English and they are “avan” he, “aval” she and with its antecedents using graphical tool,
“atu” it. In this work, we resolve 3rd person pro- PAlinkA, a highly customisable tool for Dis-
nouns. The distribution of pronouns in various course Annotation (Orasan, 2003) which we cus-
syntactic constructions is explained with exam- tomized for Tamil. We have used two tags
ples below. namely, MARKABLE and COREF. The corpus
Ex:4 used for training is 54,563 words which are anno-
4a. maaNavarkaLi paLLikku celkiranar. tated for anaphora –antecedent pairs and testing

613

corpus is 10,912 words. We have calculated the approaches using Centering theory for Hindi.
inter-annotator agreement which is the degree of Sobha et al., (2007) presented a salience factor
agreement among annotators. We have used Co- based with limited shallow parsing of text. Aki-
hen’s kappa as the agreement statistics. The kappa landeswari et al., (2013) used CRFs for resolution
coefficient is generally regarded as the statistics of third person pronoun. Ram et al., (2013) used
of choice for measuring agreement on ratings Tree CRFs for anaphora resolution for Tamil with
made on a nominal scale. We got a Kappa score features from dependency parsed text. In most of
of 0.87. The difference between the annotators the published works resolution of third person
were analysed and found the variation in annota- pronoun was considered and it is a non-trivial
tion. It occurred in the marking of antecedents for task.
pronominal. This is common in sentences with Pronoun resolution engine does the task of identi-
clausal inversion, and genitive drop. fying the antecedents of the pronouns. The Pro-
noun resolution is built using Conditional Ran-
3.2 Pronoun Resolution System dom Fields (CRFs) technique. Though CRFs is
Early works in anaphora resolution by Hobbs notable for sequence labelling task, we used this
(1978), Carbonell and Brown (1988), Rich and technique to classify the correct anaphor-anteced-
Luper Foy (1988) etc. were mentioned as ent pair from the possible candidate NP pairs by
knowledge intensive approach, where syntactic, presenting the features of the NP pair and by
semantic information, world knowledge and case avoiding the transition probability. While training
frames were used. Centering theory, a discourse we form positive pairs by pairing anaphoric pro-
based approach for anaphora resolution was pre- noun and correct antecedent NP and negative
sented by Grosz (1977), Joshi and Kuhn (1979). pairs by pairing anaphoric pronouns and other
Salience feature based approaches were presented NPs which match in person, number and gender
by Lappin and Leass (1994), Kennedy Boguraev (PNG) information with the anaphoric pronoun.
(1996) and Sobha et al., (2000). Indicator based, These positive and negative pairs are fed to the
knowledge poor method for anaphora resolution CRFs engine and the language model is gener-
methods were presented by Mitkov (1997, 1998). ated. While testing, when an anaphoric pronoun
One of the early works using machine learning occurs in the sentence, the noun phrases which
technique was Dagan Itai’s (1990) unsupervised match in PNG with the pronoun, that occur in the
approach based on co-occurrence words. With the preceding portion of the sentence and the four pre-
use of machine learning techniques researchers ceding sentences are collected and paired with the
work on anaphora resolution and noun phrase anaphoric pronoun and presented to CRFs engine
anaphora resolution simultaneously. The other to identify the correct anaphor-antecedent pair.
machine learning approaches for anaphora resolu- 3.2.1 Pre-processing
tion were the following. Aone and Bennett (1995), The input document is processed with a sentence
McCarty and Lahnert (1995), Soon et al., (2001), splitter and tokeniser to split the document into
Ng and Cardia (2002) had used decision tree sentences and the sentences into individual tokens
based classifier. Anaphora resolution using CRFs which include words, punctuation markers and
was presented by McCallum and Wellner (2003) symbols. The sentence split and tokenized docu-
for English, Li et al., (2008) for Chinese and ments are processed with Syntactic Processing
Sobha et al., (2011, 2013) for English and Tamil. modules. Syntactic processing modules include
In Indian languages anaphora resolution engines Morphological analyser, Part-of-Speech tagger
are demonstrated only in few languages such as and Chunker. These modules are developed in
Hindi, Bengali, Tamil, and Malayalam. Most of house.
the Indian languages do not have parser and other a) Morphological Analyser: Morphological
sophisticated pre-processing tools. The earliest analysis processes the word into component
work in Indian language, ‘Vasisth’ was a rule morphemes and assigning the correct
based multilingual anaphora resolution platform morpho-syntactic information. The Tamil
by Sobha and Patnaik (1998, 2000, 2002), where morphological Analyser is developed using
the morphological richness of Malayalam and paradigm based approach and implemented
Hindi were exploited without using full-parser. using Finite State Automata (FSA) (Vijay
The case marker information is used for identify- Sundar et.al 2010). The words are classified
ing subject, object, direct and in-direct object. according to their suffix formation and are
Prasad and Strube (2000), Uppalapu et al., (2009) marked as paradigms. The number of
and Dekwale et al., (2013) had presented different paradigms used in this system is Noun

614

paradigms: 32; Verb Paradigms: 37; The features for all possible candidate antecedent
Adjective Paradigms: 4; Adverb Paradigms: and pronoun pairs are obtained by pre-processing
1. A root word dictionary with 1, 52,590 root the input sentences with morphological analyser,
words is used for developing the POS tagger, and chunker. The features identified
morphological anlyser. The morphological can be classified as positional and syntactic fea-
analyser is tested with 12,923 words and the tures
system processed 12,596 words out of which Positional Features: The occurrence of the can-
it correctly tagged 12,305. The Precision is didate antecedent is noted in the same sentence
97.69% and Recall = 97.46%. MA returns all where the pronoun occurs or in the prior sentences
possible parse for a given word. or in prior four sentences from the current sen-
tence.
b) Part Of Speech Tagger (POS tagger): Part Syntactic Features: The syntactic arguments of
of Speech tagger disambiguates the multiple the candidate noun phrases in the sentence are a
parse given by the morphological analyser, key feature. The arguments of the noun phrases
using the context in which a word occurs. The such as subject, object, indirect object, are ob-
Part of speech tagger is developed using the tained from the case suffix affixed with the noun
machine learning technique Conditional phrase. As mentioned in section 2 the subject of a
Random fields (CRF++) (Sobha L, et.al sentence can be identified by the case marker it
2010). The features used for machine learning takes. We use morphological marking for the
is a set of linguistic suffix features along with above.
statistical suffixes and uses a window of 3 a) PoS tag and chunk tag of Candidate NP, case
words. We have used 4, 50,000 words, which marker tags of the noun.
are tagged using BIS POS tags. The system b) The suffixes which show the gender which
performs with recall 100% and Average gets attached to the verb.
Precision of 95.16%.
c) Noun and Verb Phrase Chunker: Chunking 3.3.2 Development of Tamil Language Model
is the task of grouping grammatically related
words into chunks such as noun phrase, verb We used 600 Tamil Newspaper articles for build-
phrase, adjectival phrase etc. The system is ing the language model. The preparation of the
developed using the machine learning training data is described below. The raw corpus
technique, Conditional Random is processed with sentence splitter and tokeniser.
fields(CRF++) (Sobha L et.al 2010). The fea- The tokenized corpus is then preprocessed with
tures used are the POS tag, Word and window shallow processing modules, namely, morpholog-
of 5 words. Training Corpus is 74,000 words. ical analyser, part-of-speech tagger and chunker.
The recall is 100%. Average Precision of The training data is prepared from this processed
92.00%. corpus. For each pronoun, the Noun phrase (NP)
preceding the pronouns and in the NPs in preced-
3.3 Pronoun Resolution Engine ing sentence till correct antecedent NP, which
match in Person, number and Gender (PNG) are
In both training and testing phase, the noun selected for training. The above features are used
phrases (NP) which match with the PNG of the for CRF for learning. The system was evaluated
pronoun are considered. The features are ex- with data from the web and the result is given be-
tracted from these NPs. In the training phase the low.
positive and negative pairs are marked and fed to Domain Testing Precision Recall
the ML engine for generating a language model. Corpus (%) (%)
In the testing phase these NPs with its features are (Words)
input to the language model to identify the ante- News 10,912 86.2 66.67
cedent of a pronoun. Here we have not taken the data
lexical item or the word as a feature. We have
used only the grammatical tags as feature. The Table 1: Pronominal Resolution (CRFs engine)
features selected represent the syntactic position
of the anaphor –antecedent occurrence. 4. Resolution of Pronouns in Malayalam Cor-
3.3.1 Features Selection pus using Tamil Language Model
The features required for machine learning are
identified from shallow parsed input sentences.

615

In this section, we present in detail how Malaya- 1. [determiner] [quantifier][intensifier] [classi-
lam is tested using Tamil language model. Here fier][adjective] {Head Noun}. Here Head
Malayalam data is pre-processed as per the re- noun is obligatory and others are all optional
quirement of the Tamil language model test data. 2. NN+NN combination
The test data required four grammatical infor-
mation, i) POS, ii) the case marker, iii) the number 3. NN with no suffix+NN with no suffix……
gender and person and iv)chunk information. In +NN with suffix or without suffix.
the introduction we have asked three questions on
how to use a language model in source language Using the above rules we identified the noun
be used for testing a target language. The three chunks in Malayalam. The above discussed pre-
questions are dealt one by one below. processing gives an accuracy of 66% for POS and
1. Can the language model developed for suffix tagging and 63% for chunking.
one language be used for analyzing the 2. How the lexical form difference can be
other language? resolved in using the language model?
In this study we have used the language model of The second question we asked is about the lexical
the source language Tamil. The features used to form or words which are not similar in both the
develop this language model are POS tag, Chunk languages and how this can be resolved. The anal-
tag and the case/ suffix tags. The word form was ysis of Tamil has shown that the syntactic struc-
not considered as a feature. Since these are the ture of the language has more prominence over the
features used for learning, the test data also should words in the resolution of anaphors. Hence we
have these information. As said earlier, Malaya- have taken the syntactic features and did not take
lam is not a resource rich language and it does not word as a feature. The system learned only the
have pre-processing engines such as POS tagger, structure patterns.
Chunker and morphological analysers with high 3. How to overcome the challenges of script
accuracy. Hence we developed a very rudimen- variation?
tary preprocessing systems which can give the Since word feature is not considered the script do
POS tag, case/suffix tags and chunk tags. not pose any challenges. Still to have the same
The POS information: The POS and suffix infor- script we converted the two languages into one
mation are assigned to the corpus using a root form the WX notation. This helped in having the
word dictionary with part of speech information same representation of the languages.
and a suffix dictionary.
The dictionary has the root words which include 4. Testing with Malayalam Data
all types of pronouns and contains nearly 66,000 We selected 300 articles from Malayalam Wik-
root words. The grammatical information (POS) ipedia, which were on different genre and size.
such as noun, verb, adjective, pronoun and num- The 300 Malayalam documents from Wikipedia
ber gender person (PNG) information are given has 7600 sentences with 3660 3rd person pro-
for a word in the dictionary. nouns. The distribution of the pronouns is pre-
The suffix dictionary contains all possible suf- sented in Table 2.
fixes which a root word can take. The suffix in- Number of
cludes the changes in sandhi when added to the Pronoun Occurrences with
root word. The suffixes are of two types, i) that its Inflected forms
which gets attached to nouns called the case suf- avan (3rd person masculine
1120
fixes and 2) that which gets attached to verbs singular)
called the TAM (Tense, Aspect, and Modal). Us- aval (3rd person feminine
840
singular)
ing this suffix dictionary we can identify the POS
avar (3rd person honorific) 420
of the word even if the word is not present in the
athu (3rd person nuetor
root word dictionary. The suffix dictionary has 1280
singular)
1,00,000 unique entries. Total 3660
The noun chunk information is given by a rule Table 2. Distribution of 3rd person pronouns in the
based chunker which works on three linguistic corpus
rules. The noun phrases alone are required for As discussed in the earlier section, the prepro-
anaphora resolution. Chunks are identified using cessing done for Tamil using syntactic module are
the basic linguistic rule for Noun phrases as given morphological information, POS and Chunking
below information. Hence the same pre-processing in-
formation is necessary for Malayalam data as

616

well. The documents are initially preprocessed Precisi Recal
with a sentence splitter and tokenizer. The to- Type of pronoun
on (%) l (%)
kenized documents are pre-processed using the avan (3rd person masculine
70.83 69.52
pos, suffix and chunking systems discussed above singular)
to enrich the text with syntactic information. avaL (3rd person feminine
69.34 68.56
For each pronoun, we identify the possible candi- singular)
date antecedents. Those noun phrases that occur avar/avarkal (3rd person
65.45 70.34
preceding the pronoun in the current sentences plural/honorofic)
and preceding four sentences, which match in the atu (3rd person nuetor singular) 56.67 65.67
person, number gender (PNG) with the select pro- Total 68.45 67.34
noun are identified as possible candidates. For Table 3: Evaluation Results
these possible candidates we extract the features 6. Conclusion
required for CRFs techniques as explained in the In this paper we explained a method to use high
previous section. After extraction of features for resource language to resolve anaphors in less re-
selected candidate antecedents, the antecedent is source language. In this experiment the high re-
identified by using language model built using source language is Tamil and the less resource
Tamil data. The results are encouraging with 67% language is Malayalam. The results are encourag-
accuracy which is a respectable score. The errors ing. The model needs to be tested with more data
and evaluation is given in detail in the next sec- as future work.
tion.
References
5. Experiment and Discussion Carbonell J. G., and Brown R. D. 1988. Anaphora res-
olution: A multi- strategy approach. In: 12th
The experiment showed that the resolution of the International Conference on Computational Lin-
pronouns “avan” he and “aval” she is similar to guistics, 1988, pp. 96-101.
that of Tamil documents. The issues related to
split antecedent is not addressed and hence pro- Dagan I., and Itai. A. 1990. Automatic processing of
nouns which are referring to coordinated nouns large corpora for the resolution of anaphora refer-
ences. In: 13th conference on Computational lin-
were not resolved. The pronoun which is less re-
guistics, Vol. 3, Helsinki, Finland, pp.330-332.
solved is the third person neuter pronouns “atu”
compared to other pronouns. The third person Dakwale. P., Mujadia. V., Sharma. D.M. 2013. A Hy-
neuter pronoun usually has more number of pos- brid Approach for Anaphora Resolution in Hindi.
sible candidates, which leads to poor resolution. In: Proc of International Joint Conference on Natu-
Consider the following example. ral Language Processing, Nagoya, Japan, pp.977–
981.
Ex:7
avan joli ceytha jolikkarkku Li., F., Shi., S., Chen., Y., and Lv, X. 2008. Chinese
He(PRP) work(N) do(V)+past+RP worker(N)+pl+dat Pronominal Anaphora Resolution Based on Condi-
vellam kotuthu. tional Random Fields. In: International
water(N) give(V)+past Conference on Computer Science and Software En-
(He gave water to the workers who did the work.) gineering, Washington, DC, USA, pp. 731-734.

atu nallatu aayirunnu Hobbs J. 1978. Resolving pronoun references. Lingua
It(PRP) good(N) is (Copula V) +past. 44, pp. 339-352.
(It was good.) Grosz, B. J. 1977. The representation and use of focus
in dialogue understanding. Technical Report 151,
In this example, the pronoun 'atu' refers to SRI International, 333 Ravenswood Ave, Menlo
'vellam” ‘water’; in the previous sentence. In the Park, Ca. 94025.
previous sentence there are two possible candi-
Joshi A. K., and Kuhn S. 1979. Centered logic: The
date antecedents 'vellam' and 'joli 'which are 3rd role of entity centered sentence representation in
person nouns. The ML engine chooses 'joli', natural language inferencing. In: International Joint
which is in the initial position of the sentence and Conference on Artificial Intelligence.
in the subject position. When the antecedent is in
Kennedy, C., Boguraev, B. 1996 Anaphora for Every-
the object position the engine has not identified it
one: Pronominal Anaphora Resolution without a
properly. The following table gives the results of
pronouns.

617

Parser. In: 16th International Conference on Com-          Russell, B. 1919, “On Propositions: What They Are
  putational Linguistics COLING’96, Copenhagen,                and How They Mean,” Proceedings of the Aristote-
  Denmark, pp. 113–118.                                        lian Society, Supplementary Volume 2: 1–43; also
                                                               appearing in Collected Papers, Vol. 8
Lappin S., and Leass H. J. 1994. An algorithm for pro-
  nominal anaphora resolution. Computational Lin-            Senapati A., Garain U. 2013. GuiTAR-based Pronomi-
  guistics 20 (4), pp. 535-561.                                nal Anaphora Resolution in Bengal. In: Proceedings
                                                               of 51st Annual Meeting of the Association for Com-
McCallum A., and Wellner. B. 2003. Toward condi-
                                                               putational Linguistics, Soﬁa, Bulgaria pp 126–130.
  tional models of identity uncertainty with applica-
  tion to proper noun coreference. In Proceedings of         Sikdar U.K, Ekbal A., Saha S., Uryupina O., Poesio M.
  the IJCAI Workshop on Information Integration on              2013. Adapting a State-of-the-art Anaphora Resolu-
  the Web, pp. 79–84.                                           tion System for Resource-poor Language. In pro-
                                                                ceedings of International Joint Conference on Natu-
McCarthy, J. F. and Lehnert, W. G. 1995. Using deci-
                                                                ral Language Processing, Nagoya, Japan pp 815–
  sion trees for coreference resolution. In C. Mellish
                                                                821.
  (Ed.), Fourteenth International Conference on Arti-
  ficial Intelligence, pp. 1050-1055                         Sobha L. and Patnaik B. N. 2000. Vasisth: An Anaph-
                                                               ora Resolution System for Indian Languages. In
Mitkov R. 1998. Robust pronoun resolution with lim-
                                                               Proceedings of International      Conference on Ar-
  ited knowledge. In: 17th International Conference
                                                               tificial and Computational Intelligence for Decision,
  on     Computational    Linguistics    (COLING’
                                                               Control and Automation in Engineering and Indus-
  98/ACL’98), Montreal, Canada, pp. 869-875.
                                                               trial Applications, Monastir, Tunisia.
Mitkov, R. 1997. "Factors in anaphora resolution: they
                                                             Sobha L. and Patnaik,B.N. 2002. Vasisth: An anaphora
  are not the only things that matter. A case study
                                                               resolution system for Malayalam and Hindi. In Pro-
  based on two different approaches". In Proceedings
                                                               ceedings of Symposium on Translation Support
  of the ACL'97/EACL'97 workshop on Operational
                                                               Systems.
  factors in practical, robust anaphora resolution, Ma-
  drid, Spain.                                               Sobha L. 2007. Resolution of Pronominals in Tamil.
                                                               Computing Theory and Application, The IEEE
Ng V., and Cardie C. 2002. Improving machine learn-
                                                               Computer Society Press, Los Alamitos, CA, pp.
  ing approaches to coreference resolution. In. 40th
                                                               475-79.
  Annual Meeting of the Association for Computa-
  tional Linguistics, pp. 104-111.                           Sobha L., Sivaji Bandyopadhyay, Vijay Sundar Ram
                                                               R., and Akilandeswari A. 2011. NLP Tool Contest
Prasad R., and Strube,M.,2000. Discourse Salience
                                                               @ICON2011 on Anaphora Resolution in Indian
   and Pronoun Resolution in Hindi, Penn Working Pa-
                                                               Languages. In: Proceedings of ICON 2011.
   pers in Linguistics, Vol 6.3, pp. 189-208.
                                                             Sobha Lalitha Devi and Pattabhi R K Rao. 2010. Hy-
Orasan, C. 2003, PALinkA: A highly customisable tool
                                                               brid Approach for POS Tagging for Relatively Free
  for discourse annotation. SIGDIAL Workshop
                                                               Word Order Languages. In Proceedings of
  2003: 39-43
                                                               Knowledge Sharing Event on Part-Of-Speech Tag-
Preslav Nakov Hwee Tou Ng 2012. Improving Statis-              ging, CIIL, Mysore.
   tical Machine Translation for a Resource-Poor Lan-
                                                             Vijay Sundar Ram and Sobha Lalitha Devi. 2010. Noun
   guage Using Related Resource-Rich Languages;
                                                                Phrase Chunker Using Finite State Automata for an
   Journal of Artificial Intelligence Research 44 (2012)
                                                                Agglutinative Language. In Proceedings of the
   179-222;
                                                                Tamil Internet – 2010, Coimbatore, India, 218–
Ram, R.V.S. and Sobha Lalitha Devi. 2013."Pronominal            224.
  Resolution in Tamil Using Tree CRFs", In Proceed-
                                                             Soon W. H., Ng, and Lim D. 2001. A machine learning
  ings of 6th Language and Technology Conference,
                                                               approach to coreference resolution of noun phrases.
  Human Language Technologies as a challenge for
                                                               Computational Linguistics 27 (4), pp.521-544.
  Computer Science and Linguistics - 2013, Poznan,
  Poland                                                     Taku Kudo. 2005. CRF++, an open source toolkit for
                                                               CRF, http://crfpp.sourceforge.net .
Rich, E. and LuperFoy S.,1988 An architecture for
  anaphora resolution. In: Proceedings of the Second         Uppalapu. B., and Sharma, D.M. 2009. Pronoun Reso-
  Conference on Applied Natural Language Pro-                  lution For Hindi. In: Proceedings of 7th Discourse
  cessing, Austin, Texas.                                      Anaphora and Anaphor Resolution Colloquium
                                                               (DAARC 09), pp. 123-134.

                                                           618

You can also read