How Cute is Pikachu? Gathering and Ranking Pok emon Properties from Data with Pok emon Word Embeddings
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
How Cute is Pikachu? Gathering and Ranking Pokémon Properties from Data with Pokémon Word Embeddings Mika Hämäläinen, Khalid Alnajjar and Niko Partanen Department of Digital Humanities University of Helsinki first.lastname@helsinki.fi Abstract models1 freely available on Zenodo together with the Pokémon story corpus2 . arXiv:2108.09546v1 [cs.CL] 21 Aug 2021 We present different methods for obtaining de- Pokémon has been a topic of research in scriptive properties automatically for the 151 original Pokémon. We train several differ- the past (Salter et al., 2019; Geissler et al., 2020; ent word embeddings models on a crawled Vaterlaus et al., 2019). However, it has eluded Pokémon corpus, and use them to rank au- any wide-spread NLP research interest. However, tomatically English adjectives based on how Pokémon names are surprisingly problematic for characteristic they are to a given Pokémon. current NLP methods as we will show in this pa- Based on our experiments, it is better to train per. a model with domain specific data than to use Stereotypical knowledge has been successfully a pretrained model. Word2Vec produces less extracted in the past (Veale and Hao, 2008). Their noise in the results than fastText model. Fur- thermore, we expand the list of properties for method relied on using Google search API to mine each Pokémon automatically. However, none stereotypical adjective-noun relations with an ”AS of the methods is spot on and there is a consid- adjective AS [a/an] NOUN” query. However, such erable amount of noise in the different seman- a method requires a lot of data in order for it to tic models. Our models have been released on work and using such a query on a reasonably sized Zenodo. corpus yields hardly any results, based on our ex- periences. 1 Introduction For proper nouns, or more precisely famous Using knowledge-bases that contain properties characters, the simplest approach for building such typical for nouns has been in the heart of compu- a knowledge-base has been manual annotation as tational creativity research for a long time. Such in the case of the Non-Official Characterization data has proven itself useful when generating a list (Veale, 2016). While, the NOC list is a valu- variety of different types of creative language able resource for properties for famous characters, such as metaphors (Veale and Hao, 2007), poems we are looking into a more automated method for (Hämäläinen, 2018) or riddles (Ritchie, 2003). producing a similar knowledge-base for Pokémon. In this paper, we present a novel approach for There has been an automated effort for ex- constructing such a knowledge-base automatically panding the properties recorded in the NOC list for the 151 original Pokémon. Our approach is ap- (Alnajjar et al., 2017). While this method is a step plicable in scenarios with a limited amount of data towards the desired direction in the sense that it available. The resulting knowledge-base can be does not require the nouns to exist in a massive cor- used in the future for generating creative language pus, it still relies on mined associations between based on Pokémon such as similes and metaphors adjectival properties and a hand annotated list of (e.g. cute as a Pikachu or confused as a Psyduck). properties for famous characters in order to ex- We have made the Pokémon word embeddings pand them further. This is an English translation of the original paper pub- In our approach, we propose a method for lished in Finnish: Hämäläinen, M., Alnajjar, K. & Parta- extracting properties for Pokémon automatically nen, N. (2021). Nettikorpuksen avulla tuotettuja sanavek- 1 torimalleja Pokémonien ominaisuuksien kuvaamiseksi. In Pokémon word embeddings models: Saarikivi, T. & Saarikivi, J. (eds.) Turhan tiedon kirja – https://zenodo.org/record/4554478 2 Tutkimuksista pois jätettyjä sivuja. p. 199-214. SKS Kirjat. Pokémon corpus: https://zenodo.org/record/4552785
from a very small corpus. Furthermore, we use method, we use TF-IDF (term frequency–inverse a larger Pokémon specific corpus to automatically document frequency) based method for extract- rank the extracted properties so that a higher rank ing and ranking Pokémon properties on the de- is given to the properties that are most descriptive scription corpus. We compare the results of the of a given Pokémon. TF-IDF method to different methods using se- mantic relatedness and similarity word embed- 2 Data and Preprocessing dings models. For semantic relatedness we In order to gather properties for each Pokémon, we build a log-likelihood matrix of term-to-term re- look into Wikidata3 , while Wikidata does not con- lations based on their co-occurrences following tain descriptive properties, it provides us with un- the implementation of Meta4Meaning (Xiao et al., ambiguous links to Giantbomb4 entries. We use 2016) and for semantic similarities we uti- the Wikidata entry list of Pokémon introduced in lize word2vec (Mikolov et al., 2013) and fast- Generation I 5 to obtain these Giantbomb links. Text (Bojanowski et al., 2016) models. We test Giantbomb is a website listing information on out all the methods with generic pretrained mod- video game characters. Unlike resources such a els and with domain-specific models trained on Bulbapedia6 they provide a concise description in- the story corpus to see how big of a difference a cluding useful information such as characteristics domain specific corpus makes for the task of auto- and physical abilities without going too deep into matic extraction of properties. the use of the Pokémon in video games. This data, We collect an initial set of adjectival prop- however, is not structural but rather free formed erties for each Pokémon from the Pokémon textual description. This data constitutes our small description corpus by processing it using Pokémon description corpus. spaCy(Honnibal and Johnson, 2015) and retain- In order to rank Pokémon properties, we crawl ing adjectives appearing in the descriptions of a larger corpus of texts written about Pokémon. the Pokémon. This step yields an unranked Many Wikipedia-like sources are too neutral list of few adjectival properties that are used to to reveal anything meaningful about Pokémon, describe the Pokémon. However, it also includes Pokédex entries are usually too short and non- very generic adjectives such as original and in descriptive for our needs. Subtitles form the some cases might find no adjectives due to very Pokémon TV show come with their own problem short descriptions. As an example, the properties of audio-visual grounding of the text. Fortunately, collected for Pikachu included: {electric, petite, we found a great resource of stories authored by close, cute, yellow, high, . . . , first, electrical}. Pokémon fans called Fanfiction7 . Next, we investigate methods for ranking and The evident problem of the resource is that expanding the properties of each Pokémon. The many of the stories are poorly written, and that first method makes use of the TF-IDF method there are stories written in multiple languages. To where we build the TF-IDF matrix from the mitigate this, we use the search functionality of the Pokémon description corpus by treating Pokémon service to find stories by the query pokemon that as documents and their descriptions as features us- are in English and have at least 10k words. This re- ing Scikit-learn (Pedregosa et al., 2011). The in- sults in 8,011 fan-authored stories about Pokémon. tuition here is that TF-IDF would capture the im- We crawl only the stories that meet these crite- portance of each feature to Pokémon. As a result, ria. This forms our bigger Pokémon stories corpus, this gives us a list of words for each Pokémon which we process by doing sentence and word to- together with its strength of importance to the kenization with NLTK (Bird et al., 2009). Pokémon. This is a very simplistic way of ranking 3 Extracting Pokémon Properties the Pokémon properties without using the larger story corpus. Using the importance scores re- We experiment with multiple ways of extracting turned by TF-IDF to rank the properties retrieved the properties for each Pokémon. In the first in the previous step, we get the following ranked 3 https://www.wikidata.org/ properties to Pikachu: {lovable, onomatopoetic, 4 https://www.giantbomb.com/ prolific, stubborn, superlative, unbeknownst, . . . , 5 https://www.wikidata.org/wiki/Q3245450 -, 15th}. 6 https://bulbapedia.bulbagarden.net/ 7 https://www.fanfiction.net/ In the following steps, we rank the collected ad-
Pokémon TF-IDF Pokémon fastText Pre-trained fastText Pokémon Word2Vec Pokémon Relatedness back, big, dark, parasitic, poisonful, sapping, crab-like, scuttled, solar, evolved, Parasect QF, Oz, EP, XL, foe lower, parasite sapping, crab-like, poison Polish, poison, sapped sent, male full, twisted, pokemon, fossil, Oman, Omani, beached, fossil, crab-like, fossil, caught, scald, Omanyte JV, EP, tapu, mi, zoid original, strange fossil-like, crab-like dorsal, evolved level, prehistoric bubble, squirtish, high-current, bubble, beached, bubble, caught, evolved, Horsea pokemon, original, powerful QF, ray, zoid, animé, peaty splashing, swime high-pressured, dorsal, scald level, swimming pure, true, mysterious, lubric, whinny, mane, whinny, earth-shaking, back, canine, large, Arcanine EP, XL, JV, pi, glew select, majestic canine, dismounted orange-yellow, high-pressured, scald sent, male disable, Mole, Chinglish, hypnotic, dinged, sapping, psychic, teleporting, Abra original, psychic Oz, ex, D., EP, Ona psychic, Minimite psychic, evolved side, evolved, cast prominent, pokemon, beached, high-current, swime, beached, tidal, high-pressured, released, trapped, Seaking QF, JV, EP, A1, zoid original dorsal, hydro seismic, dorsal swimming, sent, causing smallest, negative, sad, mane, bristled, crackled, whinny, supercharged, evolved, electric, Jolteon QF, EP, XL, JV, pi shortest, startled wagging, veed high-pressured, pi, wagging spiky, male, female fiery, pokemon, original, fire-hot, knock-on, punch, five-pointed, high-pressured, punch, fiery, sent, Magmar foe, Oz, EP, XL, zoid smaller, intense seismic, scald seismic, scald, hydro flame, causing beautiful, top, wide, bat-wing, cawing, flappish, cawing, flapped, seismic, back, flapped, evolved, Pidgeot QF, EP, XL, zoid, glew thick, unsuspecting preened, flapped lightning-quick, roosting flapping, landed Table 1: Top 5 adjectives produced by different methods for 9 randomly selected Pokémon. Pokémon TF-IDF Pokémon fastText Pre-trained fastText Pokémon word2Vec Pokémon Relatedness dangerous, knowledgeable, beautiful, dreamy, amusing, funny, surprising, public, versatile, despicable, Drowzee intelligent, ruthless, twisted raw, alluring, sensuous charming, relaxed specified, needed light, inconspicuous, fresh, handsome, rugged, inflexible, fixed, boring, soulful, grandiose, expressive, Magnemite insignificant, memorable individual stolid, unchanging exciting, urgent dismayed, amazed, grandiose, funky, twisty, grandiose, funky, twisty, sturdy, potent, raw, Raichu horrified, outraged, surprised crazed, exciting crazed, exciting versatile, wealthy loud, dangerous, bitter, divisive, alive, frustrated, disappointed, Beedrill clear, deadly, slick deadly, vulgar bitter, scared, shocked beautiful, creative, innovative, satisfying, healthy, safe, scary, inhuman, cunning, Exeggcute varied, diverse delicious, tasty brutal, mean creative, innovative, fresh, harmful, dangerous, deadly, harmful, dangerous, deadly, dangerous, public, specified, Weezing memorable, quirky slick, lethal slick, lethal needed, slick beautiful, professional, intelligent, crafty, clever, funny, dominant, raised, identifying, Meowth cunning, brutal expressive, versatile well-meaning, treacherous normal, known beautiful, shiny, crafty, clever, funny, dominant, raised, identifying, evocative, natural, fallible, Ninetales round, passionate, merry well-meaning, treacherous normal, known alive, feminine scary, dangerous, funny, fluid, dangerous, detestable, dangerous, potent, odious, dominant, feminine, identifying, Arbok intense, ruthless unpredictable, totalitarian slick, carcinogenic damp, busted Table 2: Expanded properties for 9 randomly selected Pokémon. jectival properties using semantic relatedness and exclusive, maximum}. similarities word embeddings models. For each We use word2vec and fastText as the seman- method, we test out two versions, one that is pre- tic similarity word embeddings models. We trained on generic text such as Common Crawls use a skip-gram model with the default hy- and Wikipedia, and another that is trained on the perparameters for both fastText and word2vec. Pokémon stories corpus. Our word embeddings method consists of hav- We follow the approach described ing a list of properties (adjectives) the simi- by (Xiao et al., 2016) to build a relatedness larity of which is compared against the vec- matrix by obtaining co-occurrences and then com- tor of each Pokémon by a dot product. The pute the simple log-likelihood as a measurement more similar the property is to a Pokémon, the of relatedness between two words based on their higher it ranks. As the pretrained word2vec and individual frequencies and their observed and fastText models, we use the models provided expected co-occurrences in the corpus. We use by (Kutuzov et al., 2017)8 and (Mikolov et al., the ukWac corpus (Ferraresi et al., 2008) as the 2018), respectively. For our Pokémon-specific generic corpus and build two relatedness models model, we utilize Gensim (Řehůřek and Sojka, using the generic corpus and the Pokémon stories 2010) to train the word2vec model and the official corpus. It appears that none of the Pokémon fastText library (Bojanowski et al., 2017) to build got captured in the generic model except for the fastText model from the Pokémon stories cor- two Pokémon, Persian and Ditto, which is due pus. to the different meaning they represent in the Similarly to the generic relatedness model, real world. Ranking Pikachu properties using Pokémon names did not appear in the pretrained the domain-specific model results in: {electric, 8 yellow, electrical, female, quick, powerful, . . . , http://vectors.nlpl.eu/repository/20/3.zip
word2vec model. Nonetheless, due to the fast- is low. For water Pokémon, swime10 gets a high Texts ability to use subword information during score mostly due to the fact that it is close to the the training phase, it was able to produce semantic word swim. similarities between Pokémon and adjectival prop- Throughout the results, we can see that the ob- erties. Sorting Pikachu’s properties using the pre- scurity of some of the adjectives in the OED con- trained fastText and Pokémon-specific word2vec fuses the models. Better results could be achieved and fastText models gives: if the list of adjectives was obtained from a corpus fastText (pretrained): {cute, chuchu, red, -, evil, instead of a comprehensive dictionary that also yellow, japanese, . . . , tumultuous, non}, records historical, obsolete and dialectal words. word2vec (Pokémon): {electric, chuchu, -, elec- It is very difficult to pick the overall best model trical, quick, yellow, cute, . . . , capable, promi- for the task, as all of them work better for certain nent}, Pokémon than the others. We can, however, gather fastText (Pokémon): {electric, chuchu, electri- that word embedding models that are trained on cal, cute, yellow, close, . . . , prolific, 15th}. a domain specific corpus work better than using In order to extract a ranked list of properties for TF-IDF to extract terms from short documents or each Pokémon from the word embedding models, using a pretrained model. Word2Vec seems to pro- we compute the similarity for each Pokémon and duce less noise than fastText. every single adjective in the Oxford English Dic- In Table 2, we can see the resulting top 5 new tionary (OED)9 and sort these words (properties) properties produced by the automatic expansion of based on their similarity with each Pokémon. properties based on the lists for the top 10 prop- Furthermore, we experiment with an existing erties produced by each method. None of the method for expanding properties for the results of extended properties for Beedrill, Exeggcute and each method. The property expansion is based on Raichu were descriptive enough to be highlighted the data and algorithm presented by Alnajjar et al. as the best result. All in all, the expanded prop- (2017). The method takes in a list of properties erties are very poor at describing each individual and produces an extended property list by using Pokémon. Based on these results, we cannot rec- Thesaurus Rex data (Veale and Li, 2013). We use ommend using an automatic property expansion this method to predict more properties by feeding for Pokémon as it seems to favor properties typi- in the top 10 adjectives produced by each model. cal for people. The method also failed to expand some of the properties for some models, and all of 4 Results the properties for the pre-trained fastText model. Table 1 shows results for different Pokémon by the 5 Conclusions different methods. The table shows results for the word embeddings models when using adjectives In this paper, we have presented our initial ap- from the OED. The pretrained Word2Vec model proaches in mining properties for Pokémon char- and generic relatedness model are missing from acters. The result look promising, although they the table as they did not produce any results at all reveal problems in the semantic representations of for any Pokémon. The cells in bold have the high- word embedding models, especially in pre-trained est number of descriptive adjectives. ones that belong to a different domain of text. The We can see that the pre-trained fastText model task of automatically extracting meaningful prop- does not capture the semantics of any Pokémon at erties is far from trivial and calls for more future all. All in all, fastText seems to produce good ad- work. Nonetheless, our approach is a step away jectives in the top results, but it clearly struggles from expert annotated data into a fully automatic with the out-of-vocabulary adjectives. Instead of methodology. not returning a vector for them at all, and thus ig- The journey has just begun, so in the future dif- noring them, it has been designed to return vectors ferent experiments could be conducted in terms based on the character level similarity of the word. of what kind of adjectives are used to query the For this reason, Oman and Omani, words that did word embedding models for each Pokémon. Also, not occur in the training corpus, get highly asso- a hybrid approach could be taken to combine the ciated with Omanyte, as their character distance 10 OED: Used vaguely (like the noun) in Destr. Troy = 9 https://www.oed.com/ giddy, dazed, and (actively) stunning.
strengths of each individual model; the more mod- Matthew Honnibal and Mark Johnson. 2015. els point towards a certain property, the more An improved non-monotonic transition system for dependency parsin In Proceedings of the 2015 Conference on Empir- likely it is to be a descriptive one of a given ical Methods in Natural Language Processing, Pokémon. pages 1373–1378, Lisbon, Portugal. Association for Based on our research, we can conclude that the Computational Linguistics. pretrained models do not work with Pokémon at Andrei Kutuzov, Murhaf Fares, Stephan Oepen, and all. Clearly, Pokémon itself is by no means so de- Erik Velldal. 2017. Word vectors, reuse, and repli- viant a phenomenon that it could not be modeled cability: Towards a community repository of large- with word embeddings. The problem we can see is text resources. In Proceedings of the 58th Confer- part of a wider phenomenon that has received a lit- ence on Simulation and Modelling, pages 271–276. Linköping University Electronic Press. tle attention in the field of NLP. If pretrained mod- els, which are constantly used in various NLP stud- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- ies, are not able to describe Pokémon, what other frey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint phenomena might they describe equally poorly? arXiv:1301.3781. In general, our discipline does not pay very much attention to how well computational models work Tomas Mikolov, Edouard Grave, Piotr Bojanowski, when applied to a completely new context. Christian Puhrsch, and Armand Joulin. 2018. Ad- vances in pre-training distributed word representa- The embeddings trained in this paper may be tions. In Proceedings of the International Confer- useful in a variety of different computational cre- ence on Language Resources and Evaluation (LREC ativity tasks relating to Pokémon. Therefore we 2018). have released the models and the code freely on F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Zenodo (links on the first page of this paper). B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- References esnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, Khalid Alnajjar, Mika Hämäläinen, Hanyang Chen, 12:2825–2830. and Hannu Toivonen. 2017. Expanding and weight- ing stereotypical properties of human characters for Radim Řehůřek and Petr Sojka. 2010. Software Frame- linguistic creativity. In ICCC, pages 25–32. work for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Steven Bird, Ewan Klein, and Edward Loper. 2009. Challenges for NLP Frameworks, pages 45–50, Val- Natural Language Processing with Python, 1st edi- letta, Malta. ELRA. tion. O’Reilly Media, Inc. Piotr Bojanowski, Edouard Grave, Armand Joulin, Graeme Ritchie. 2003. The jape riddle generator: tech- and Tomas Mikolov. 2016. Enriching word vec- nical specification. Institute for Communicating and tors with subword information. arXiv preprint Collaborative Systems. arXiv:1607.04606. Anastasia Salter, Mel Stanfill, and Anne Sullivan. 2019. Piotr Bojanowski, Edouard Grave, Armand Joulin, and But does pikachu love you? reproductive labor in casual and hardcore Tomas Mikolov. 2017. Enriching word vectors with In Proceedings of the 14th International Conference subword information. Transactions of the Associa- on the Foundations of Digital Games, New York, tion for Computational Linguistics, 5:135–146. NY, USA. Association for Computing Machinery. Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and J Mitchell Vaterlaus, Kala Frantz, and Tracey Robecker. Silvia Bernardini. 2008. Introducing and evaluating 2019. “reliving my childhood dream of being a ukwac, a very large web-derived corpus of english. pokémon trainer”: An exploratory study of college In Proceedings of the 4th Web as Corpus Workshop student uses and gratifications related to pokémon (WAC-4) Can we beat Google, pages 47–54. go. International Journal of Human–Computer In- teraction, 35(7):596–604. Dominique Geissler, Elisa Nguyen, Daphne Theodor- akopoulos, and Lorenzo Gatti. 2020. Pokérator- Tony Veale. 2016. Round up the usual suspects: unveil your inner pokémon. In Proceedings of Knowledge-based metaphor generation. In Proceed- the Eleventh International Conference on Computa- ings of the Fourth Workshop on Metaphor in NLP, tional Creativity. pages 34–41. Mika Hämäläinen. 2018. Harnessing nlg to create Tony Veale and Yanfen Hao. 2007. Comprehending finnish poetry automatically. In International Con- and generating apt metaphors: a web-driven, case- ference on Computational Creativity, pages 9–15. based approach to figurative language. In AAAI, vol- Association for Computational Creativity (ACC). ume 2007, pages 1471–1476.
Tony Veale and Yanfen Hao. 2008. Enriching wordnet with folk knowledge and stereotypes. In Proceed- ings of the 4th Global WordNet Conference, Szeged, Hungary. Tony Veale and Guofu Li. 2013. Creating similarity: Lateral thinking for vertical similarity judgments. In Proceedings of the 51st Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 660–670. Ping Xiao, Khalid Alnajjar, Mark Granroth- Wilding, Kat Agres, and Hannu Toivonen. 2016. Meta4meaning: Automatic metaphor interpretation using corpus-derived word associations. In Pro- ceedings of the 7th International Conference on Computational Creativity (ICCC). Paris, France.
You can also read