Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers - ACL Anthology

Page created by Jay Burgess
 
CONTINUE READING
Human Evaluation of Creative NLG Systems: An Interdisciplinary
                        Survey on Recent Papers

                Mika Hämäläinen           Khalid Alnajjar
                  Faculty of Arts             Faculty of Arts
               University of Helsinki      University of Helsinki
         mika.hamalainen@helsinki.fi khalid.alnajjar@helsinki.fi

                       Abstract                                   NLP and CC fields conduct work from very dif-
     We survey human evaluation in papers present-             ferent starting points (Purver et al., 2016). NLP is
     ing work on creative natural language genera-             often state-of-the-art driven whereas CC presents
     tion that have been published in INLG 2020                more of exploratory research without pursuing
     and ICCC 2020. The most typical human eval-               scores that outperform a baseline. In this paper,
     uation method is a scaled survey, typically on            we want to study how human evaluation of creative
     a 5 point scale, while many other less common             NLG systems is conducted in the world of NLP and
     methods exist. The most commonly evalu-
                                                               in the world of CC, what similarities there are and
     ated parameters are meaning, syntactic correct-
     ness, novelty, relevance and emotional value,             whether the two fields can learn something from
     among many others. Our guidelines for future              each other.
     evaluation include clearly defining the goal of              We base our research on a literature review on
     the generative system, asking questions as con-           the papers dealing with human evaluated creative
     crete as possible, testing the evaluation setup,          NLG published in the 2020 editions of the Inter-
     using multiple different evaluation setups, re-
                                                               national Conference on Computational Creativity
     porting the entire evaluation process and po-
     tential biases clearly, and finally analyzing the         (ICCC) and of the International Conference on
     evaluation results in a more profound way than            Natural Language Generation (INLG). We picked
     merely reporting the most typical statistics.             these conferences as ICCC is the most important
                                                               venue for CC research, and INLG the most impor-
1    Introduction
                                                               tant NLP focused venue for NLG research.
Human evaluation in natural language generation                   Our results show that there is no consensus at the
(NLG) has become a hot topic lately, with the emer-            moment on how evaluation should be conducted
gence of several survey papers on the topic that               despite the many different efforts of establishing
study how human evaluation has been conducted in               guidelines for evaluating computationally creative
the past in the field of NLG in general (Howcroft              output (Pease and Colton, 2011; Jordanous, 2012;
et al., 2020; Belz et al., 2020). This has led to              Lamb et al., 2018; Hämäläinen, 2020). We reflect
several recent evaluation frameworks for evaluat-              on the results of our survey and propose a road-map
ing the output of NLG systems (Liu et al., 2020;               for more sound future evaluation practices.
Gehrmann et al., 2021).
   However, not all natural language generation                2   Surveying human evaluation methods
tasks are of the nature that they are designed to con-
vey factual information. Some of the NLG tasks                 In this section, we go trough how human evaluation
deal with producing text of aesthetic nature such              was conducted in the papers we selected for the
as poetry, stories, humor and so on. We call these             survey. From the ICCC proceedings, we included
creative NLG tasks. These types of tasks are si-               all papers that dealt with NLG and had a human
multaneously researched in two distinct fields of              evaluation. We did not survey papers that presented
science: natural language processing (NLP) and                 work on generating something else than language
computational creativity (CC). Existing survey pa-             such as music. From the INLG proceedings, we
pers have only focused on NLP research and they                picked all papers that presented work on an open-
have not made a distinction between creative and               ended NLG problem the output of which could
non-creative NLG.                                              exhibit some creativity ruling out papers that dealt

                                                          84
    Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 84–95
                            August 5–6, 2021. ©2021 Association for Computational Linguistics
Paper                                    NLG task                    Evaluated parameters                               Questions motivated                  Evaluation type
                                                                                                                        Engagement measured the
1. Mathewson et al. 2020                 Collaborative dialogue      engagement                                                                              Ranking models
                                                                                                                        notions of revealing and concealing.
                                                                     Support of self-expression, therapeutic value
2. Cheatley et al. 2020                  Song writing tool                                                              Not discussed                       User study (qualitative)
                                                                     and receptiveness to the tool and songs created
                                         Auxiliary tool for
3. Mirowski et al. 2020                                              Based on critics’ previews and reviews             No questions                        public performance
                                         improv theater
                                         Generating six
4. Spendlove and Ventura 2020                                        coherence, impactfulness                           Not discussed                       5 point scale
                                         word stories
                                         Quest generation in         coherence, originality (novelty), sense of
5. Ammanabrolu et al. 2020                                                                                              By Boden’s theory on creativity     7 point scale
                                         text adventure games        acomplishment (value), unpredictability (surprise)
                                         Headline-proverb
6. Mendes and Oliveira 2020b                                         relatedness, funniness                             Not discussed                       4 point scale
                                         pair generation
                                                                     funniness, surprise, cleverness, did the user laugh,
7. Tyler et al. 2020                     Pun generation                                                                   Not discussed                     5 point scale
                                                                     wit, ingenuity, timelessness, and accessibility
                                         Contextual headline
8. Mendes and Oliveira 2020c                                         syntax, relatedness, funniness                     Not discussed                       3 point scale
                                         adaptation
                                                                     poem (yes/no), typicality, understandability,
9. Hämäläinen et al. 2020 evaluation 1 Dialectal adaptation       quality of language, evoked imagery,               Previous research                   5 point scale
                                          of generated poetry        evoked emotions, annotator’s liking
                                                                     emotivity, originality, creativity,
Hämäläinen et al. 2020 evaluation 2                                                                                  Not discussed                       Association
                                                                     poem-likeness, artificiality, fluency
                                                                     annotator’s perception, coherence, rhythm,
                                                                                                                                                            open ended questions
                                         Real time human-            rhyme, quality, enjoyment, relation between
10. Savery et al. 2020                                                                                                  By research questions               + automatic analysis,
                                         machine rap battles         the hip hop and metal dataset,
                                                                                                                                                            preference
                                                                     and relationship between input and output
                                                                     familiarity, novelty, grammaticality,
                                         Song lyric                                                                                                         5 point scale and
11. Oliveira 2020                                                    semantics, singability, overall appreciation       Not discussed
                                         transformation                                                                                                     picking the most suitable topic
                                                                     and topicality
                                                                     typicality, understandability,
                                         Emily Dickinson style
12. Shihadeh and Ackerman 2020                                       quality of language, evoked imagery,               Previous research                   5 point scale
                                         poem generation
                                                                     evoked emotions, annotator’s liking
                                                                     content preservation, transfer strength
13. Gong et al. 2020                     Text style transfer                                                            Automated evaluation                picking the best
                                                                     and fluency
                                                                                                                                                            5 point scale and
                                         Text generation             informativeness, conciseness,
14. Obeid and Hoque 2020                                                                                                Not discussed                       yes/no/partially/can’t decide
                                         from charts                 coherence, fluency, factuality
                                                                                                                                                            for factuality
15. Lee 2020                             Style transform             content, fluency, and style                        Not discussed                       5 point scale
                                         Enhancing headlines
16. Mendes and Oliveira 2020a                                        relatedness, funniness                             Not discussed                       4 point scale
                                         with creative expressions
                                         Referring expression        comprehension based on
                                                                                                                                                            user study based on
17. Langner 2020                         generation in a             identification time, error rate                    Not discussed
                                                                                                                                                            quantitative values
                                         virtual environment         and repetition counts
                                         Question generation         readability, caption relevance
18. Scialom et al. 2020                                                                                                 Not discussed                       5 point scale
                                         from images                 and image relevance
                                         Multi-sentence image        word choice, object salience,
19. Ilinykh and Dobnik 2020                                                                                             Not discussed                       slider
                                         description generation      sentence structure and paragraph coherence
                                                                                                                                                            relevance (correct/not correct),
20. Akermi et al. 2020                   Question answering          relevance, errors                                  Not discussed                       error type checkboxes,
                                                                                                                                                            open ended comment field
21. Nikolov et al. 2020 evaluation 1                                 style, meaning, familiarity                        Not discussed                       5 point scale
                                         Rap lyric generation                                                                                               picking which out of 2
Nikolov et al. 2020 evaluation 2                                     Turing test                                        Not discussed
                                                                                                                                                            is written by a human
Nikolov et al. 2020 evaluation 3                                 Turing test                                            Not discussed                       human written (yes/no)
22. Wang et al. 2020                     Paper review generation constructivenness and validity                         Not discussed                       not stated
                                         Response generation
23. Hedayatnia et al. 2020                                       appropriateness                                        Previous research                   picking the best
                                         in a dialog system

                    Table 1: Evaluated parameters, their motivation and evaluation type in the surveyed papers

with purely factual data-to-text generation tasks.                                                    or laymen. Finally, we looked into the discussions
   In the ICCC 2020, there were 12 papers that pre-                                                   and conclusions presented in the papers to see what
sented human evaluated work on creative NLG, and                                                      role the human evaluation had there, especially in
in the INLG 2020, there were 11 such papers. We                                                       relation to concrete future directions in improving
selected these papers for our survey. Fortunately,                                                    the system based on the evaluation results.
both of the venues had relatively the same amount
of papers.                                                                                            2.1     What is evaluated?
   When surveying the papers, we only focused on                                                      Table 1 shows the results of our survey in terms
human evaluation and we wanted to know what                                                           of what parameters were evaluated and how the
the NLG task was, what parameters were being                                                          evaluation was conducted. Papers 1-12 were pub-
evaluated (usually reflected by the evaluation ques-                                                  lished in ICCC and represent the CC field, whereas
tions), how these parameters (questions) were moti-                                                   papers 13-23 were published in INLG representing
vated and how the actual evaluation was conducted                                                     the NLP side of the same coin.
methodologically. We also paid attention to the                                                          When looking at the results, we can immediately
evaluation setup: the number of evaluators and sam-                                                   see that there is quite a range of different NLG
ples used and whether the evaluators were experts                                                     tasks. Even for papers that deal with very similar

                                                                                              85
tasks such as papers 2, 10, 11 and 21, the framing          for evaluation. Paper 5 motivated the evaluated
of the problem is very different ranging from lyric         parameters through an existing theory on compu-
transformation to full-blown human versus com-              tational creativity (Boden, 2007). Paper 10 had
puter rap battles. The evaluated parameters were            formulated the evaluated parameters based on the
also very different.                                        research questions established in the paper. Paper
   Despite the parameters being very different from         13 formulated the evaluated parameters so that they
each other, several papers evaluated meaning in             would measure the same things as their automated
one way or another, for example, papers 4, 5, 10,           evaluation.
14 and 19 evaluated coherence, paper 11 semantics              Paper 9 and 12 used evaluation questions origi-
and paper 21 meaning. Papers 9 and 12 evaluated             nally established by Toivanen et al. (2012). While
understandability, which is not directly the same as        basing evaluation on existing research makes the
meaning.                                                    evaluation questions sound more well motivated,
   Syntactic correctness of the language was also           the original paper where these evaluation questions
one of the commonly evaluated features. Papers 9,           were first established did not present any reasoning
13 and 14 measure fluency, paper 11 grammatical-            as to why these should be the evaluation questions
ity, papers 9 and 12 quality of language and paper          to be used with generated poetry. Also paper 23
8 syntax. In addition paper 18 evaluated readabil-          stated they used ”a similar setup” as proposed by
ity, which is partially related to correctness and          Li et al. (2016). In practice this meant that whereas
partially to meaning.                                       the original paper proposed 3 different evaluation
   One of the parameters that was evaluated                 setups, paper 23 only presented one of them. The
through multiple synonyms and even antonyms                 reasoning for this evaluation was not discussed in
was novelty. Papers 5 and 9 evaluated original-             the original paper.
ity, paper 11 novelty, paper 7 surprise and paper
5 unpredictability. Papers 9 and 12 evaluated the           2.3   How is the evaluation conducted?
opposite of novelty, which is typicality.
                                                            Most of the papers present only one human eval-
   Relevance was also commonly evaluated in pa-             uation method. The exceptions are paper 9 that
pers 18 and 20. The parameter was evaluated as              presents two distinct evaluation setups and paper
relatedness in papers 6, 8 an 16, although all of           21 that presents 3 distinct evaluation setups.
them are by the same authors.
                                                               The most common way of conducting a human
   Many papers also evaluated emotional value.              evaluation is to use a questionnaire that is rated on
Such as paper 9 through emotivity, paper 10                 a numerical scale. Papers 4, 7, 9 (evaluation 1)
through enjoyment, paper 11 through engagement,             11, 12, 14, 15, 18 and 21 (evaluation 1) used a 5
papers 9 and 12 through evoked emotions and pa-             point scale. Papers 6 and 16, written by the same
pers 7, 6, 8 and 16 through funniness, although             authors, use a 4 point scale, and paper 8, also by
three of these papers were by the same authors.             the same authors, uses a 3 point scale. Paper 7 uses
                                                            as big as a 7 point scale. The most deviant one of
2.2   Why are the evaluation parameters
                                                            the papers using a numeric scale is paper 19. This
      chosen?
                                                            paper presents a continuous slider the annotators
The aforementioned parameters do not cover all              can move freely. Some of the papers use a different
the parameters that were used in evaluation, how-           scale for one of the questions.
ever, they were the most typical ones. When we                 The second most typical evaluation method is
look into how the evaluation parameters were se-            based on preference. Here the outputs are pre-
lected, we can notice that most of the papers do            ferred or ranked in relation to each other. Paper 1
not present any reasoning as to why these are the           presents a ranking method where different models
relevant attributes to look at.                             are ranked based on which one is the best. Paper
   The few papers that did present a reasoning, had         9 (evaluation 2) presents two poems side by side
many different reasons for the evaluated parame-            and asks annotators to associate the presented pa-
ters. Paper 1 motivates the evaluated parameter             rameters with either one of them. Paper 10 uses
by stating that it evaluates revealing and conceal-         preference of output as one of the evaluation cri-
ing parameters that were defined important for the          teria. Papers 13 and 23 ask the annotators to pick
task. Paper 3 did not have any parameters at all            the best output candidate. Paper 21 (evaluation 2)

                                                       86
asks the annotators to guess which output is human          2.4   Sample sizes and annotators
written and which AI written. Paper 11 asks the
                                                            Table 2 shows the number of annotators and sample
annotators to rank the most suitable topic. This
                                                            sizes used in the different papers. We have tried
is slightly different as here the annotators are not
                                                            to do our best in collecting the information from
asked to rank the output per se. As we can see,
                                                            the papers, however, these parameters were not
there are a great number of different variations in
                                                            always expressed clearly. The worst example is
how this type of an evaluation is conducted. As op-
                                                            paper 3 that stated that they got multiple reviews,
posed to the most popular evaluation method, these
                                                            previews and feedback from the audience and the
methods only give relative results. This means that
                                                            actors without specifying the exact number.
even if all of the output was bad, one of them is
                                                               Most of the papers relied on non-expert anno-
still picked as the best.
                                                            tators for conducting the evaluation with the ex-
   Two papers, 2 and 17, present a user-study. Pa-          ception of paper 1, 21 and 22, and partially paper
per 2 conducts this in a qualitative way with open          3. The use of experts is understandable as not just
ended questions where the discussion is directed            about anyone is competent enough to tell whether,
towards the parameters that the authors wanted to           for example, generated reviews for scientific pa-
measure. The discussions with the participants are          pers (as in paper 22) are good or bad. However,
not fully reported in the paper, instead the authors        this leads to a small number of evaluators as ex-
present some quotes relating to the parameters in           perts are difficult to recruit. Papers that did not use
study in a non-rigorous fashion. Paper 17 presents          experts to evaluate the output either did not report
a quantitative user-study where the results are ana-        any special requirements or mostly ensured that the
lyzed based on different values such as execution           evaluators were proficient enough in the language
time that were gathered during the user-study.              of the output.
   Paper 3 presents something completely unique                In terms of the sample size, that is how many gen-
in terms of evaluation. The authors organize live           erated artefacts were evaluated, the amount varies
improv theater sessions with the system and base            a lot from anything starting from 2 as in paper 5 up
the results on the reviews and previews by critics.         to 250 as in papers 15 and 19. The samples were
However, these were not discussed in the paper in           mostly picked at random, however some papers
detail, but rather some cherry picked quotations            like paper 7 evaluated manually picked output.
were reported.                                                 There was also a lot of divergence in the num-
   Paper 10 was another paper to conduct a qual-            ber of annotators. Some papers had all annotators
itative evaluation. The annotators were asked to            go through all samples like paper 21 and 22 did,
answer to open-ended questions. The input from              while some other papers had several annotators
the annotators was then automatically processed to          that annotated the outputs so that each individual
reach to conclusions. An open-ended comments                output was evaluated at least by 3 annotators like
field was also provided in paper 20, however, the           paper 14 and 23. Usually, there wasn’t any clear
paper focused on discussing the results of the two          discussion on how many outputs a given annotator
other questions in the questionnaire. The annota-           annotated with the exception of paper 19, which
tors were asked to give a binary rating on whether          reported that a given annotator could only annotate
the output was relevant or not, similarly, paper 9          up to 30 outputs.
(evaluation 1) presented one binary question about
poeticity and Paper 21 (evaluation 3) presented a           2.5   Evaluation results
binary question whether the output was human au-            An interesting point we wanted to pay attention to
thored. In addition, paper 20 asked the annotators          was the use of the evaluation results. After conduct-
to indicate which types of errors the output had            ing a costly and time consuming human evaluation,
by providing a set of check-boxes with predefined           one would hope that the results give a direction to
error types.                                                the future research. However, this was not the case.
  Unlike the rest of the papers, paper 22 did not           All papers were limited to writing out the evalua-
explain how the evaluation was conducted in any             tion results and stating which system was better if
detail. The results were percentages, which indi-           the papers evaluated multiple systems. None of the
cates that the evaluation might have been based on          papers was able to identify any concrete future di-
binary questions.                                           rections for improving the generative system based

                                                       87
Paper                                    Experts          Number of annotators                                   Number of samples
                                                                                                                 3 conversations
1. Mathewson et al. 2020                 yes              4                                                      (5 utterance-response
                                                                                                                 pairs in each)
                                                                                                                 Free engagement
2. Cheatley et al. 2020                  no               3
                                                                                                                 with the system
                                         yes (reviews),
3. Mirowski et al. 2020                                   multiple                                               Performance
                                         no (audience)
4. Spendlove and Ventura 2020            no               14 per story                                              15 stories
5. Ammanabrolu et al. 2020               no               15 for each game                                          2 room layouts
6. Mendes and Oliveira 2020b             no               4 per headline                                            60 headlines
                                                                                                                    10 best manually
7. Tyler et al. 2020                     no               10 in total
                                                                                                                    selected puns
8. Mendes and Oliveira 2020c              no              2 in total                                                30 headlines
9. Hämäläinen et al. 2020 evaluation 1 no              5 per poem variant                                        10 poems
Hämäläinen et al. 2020 evaluation 2 no                 5 per dialectal-standard Finnish poem pair                10 parallel poems
                                                                                                                    1 video clip,
                                                                                                                    hand picked best output,
10. Savery et al. 2020                   no               33
                                                                                                                    10 additional video clips
                                                                                                                    and 10 generated tasks
11. Oliveira 2020                        no               3 per lyric                                               120 lyics
                                                                                                                    10 generated +
12. Shihadeh and Ackerman 2020           no               17 in total
                                                                                                                    2 Emily Dickinson’s poems
13. Gong et al. 2020                     no               2 in total                                                outputs for 100 inputs
14. Obeid and Hoque 2020                 no               3 per statistic                                           output for 40 charts
15. Lee 2020                             no               6 people per sample                                       250 samples
16. Mendes and Oliveira 2020a            no               4 per headline                                            60 headlines
17. Langner 2020                         no               34 participants                                           10 fixed sessions
18. Scialom et al. 2020                  no               3 in total                                                50 images
19. Ilinykh and Dobnik 2020              no               154 in total (a participant could rate at most 30 images) 250 images
20. Akermi et al. 2020                   no               20 in total                                               150 questions
21. Nikolov et al. 2020 evaluation 1     yes              3 in total                                                100 verses
Nikolov et al. 2020 evaluation 2         yes              3 in total                                                100 verses
Nikolov et al. 2020 evaluation 3         yes              3 in total                                                100 verses
22. Wang et al. 2020                     yes              2 in total                                                50 papers
                                                                                                                    200 snippets
23. Hedayatnia et al. 2020               no               3 per snippet
                                                                                                                    of 5 turn dialog

                                    Table 2: Evaluators and samples in the surveyed papers

on the human evaluation results. Human evalua-                               3   Discussion
tion was merely there to provide some convincing
                                                                             There are currently many different creative NLG
evidence on the quality of the systems.
                                                                             tasks people work with, and it is understandable
                                                                             that each task calls for slightly different evaluation
   The only exception to this was paper 9. The au-                           methods. However, even work on closely related
thors conducted two different evaluations and they                           topics prefers to use their own evaluation meth-
reached to an insightful conclusion. The two evalu-                          ods that are not based on any existing research.
ation methods contradicted each other; according to                          And most alarmingly, if the evaluation is based on
the first evaluation, standard Finnish was preferred                         existing research, the evaluation questions are not
over dialectal one in all the parameters. However,                           motivated in the earlier research either. This type of
the second evaluation showed that a dialectal poem                           evaluation has become to be known as a symptom
was more often associated with originality, creativ-                         of the Great Misalignment Problem (Hämäläinen
ity and poem-likeness than its standard Finnish                              and Alnajjar, 2021). When the evaluation is not
variant. The authors note that the results are not                           targeted towards evaluating exactly what has been
only dependent on how you conduct your human                                 modelled, any type of evaluation that seems re-
evaluation, but also on familiarity bias. In the first                       motely related to the task becomes seemingly valid.
evaluation, where dialect was a controlled variable,                            However, when the evaluated parameters have
the further the dialect was from standard Finnish,                           only little to do with what was modelled, it is only
the lower it scored as the annotators were less fa-                          evident that none of the surveyed papers was able to
miliar with it.                                                              clearly identify the short-comings of their systems

                                                                       88
in such a way that they could propose some clear                      stead, it is believed that any new evaluation metric
paths to follow for any future research. In fact,                     the authors came up with just for a given paper
if you evaluate your system based on relatedness                      will magically work as such and will yield scien-
and funniness while neither is explicitly modelled,                   tifically valid results that will pass a peer review.
how can you know how to make your system more                         All this while many of the papers ask questions us-
funny or produce more related output? The scores                      ing ambiguous terms such as fluency (is something
might have well been achieved by mere serendipity                     grammatical fluent? is something that seems to
(the annotators happened to like the humor that                       make sense semantically fluent? is something that
happened to be in the small sample) (c.f. Gervás                     is close to the annotator’s own idiolect more fluent
2017) or by data the model was trained on.                            than something further away from it? is text gener-
   Apart from the evaluation questions not align-                     ated in American English more fluent to Americans
ing with the model, a much larger problem related                     than text generated in British English? and so on)
to evaluation questions can be identified. Firstly,                   and coherence (is something that repeats the same
most of the papers were not clear about the ac-                       words coherent? can a complex figure of language
tual evaluation questions used, instead they listed                   be coherent if the annotator does not have time to
the evaluated parameters as though human evalua-                      think about it for more than a couple of seconds?
tion was like an automated one where one can just                     does coherence have something to do with gram-
score abstract notions such as typicality or fluency                  maticality as well? is a story that follows the same
accurately on a 5 point scale. In other fields, it                    beliefs as the annotator seen as more coherent? and
is known that even small changes in survey ques-                      so on) that are reduced into a compact 1-5 scale
tions can lead to different survey results (Kalton                    that is later neatly averaged over all the annotators’
and Schuman, 1982; de Bruin et al., 2011, 2012).                      opinions on all the samples. What does the aver-
Not revealing the actual questions only makes the                     age of 3.5 on a question all annotators might have
situation worse. Another problem that rises from                      interpreted differently even mean?
abstract evaluation questions is that it becomes less                    In other fields conducting online surveys, there
clear why the annotators gave certain answers.                        are a lot of worries about selection bias of the hu-
   Furthermore, people have a tendency on reading                     man subjects (Bethlehem, 2010; Greenacre, 2016).
more into computer generated output than what the                     This is hardly discussed in the fields of NLP and
intention of the system was (Veale, 2016). If you                     CC. Many of the papers we surveyed conducted
train a generative neural model on jokes, it will                     their evaluation on a crowd-sourcing platform such
surely learn to output jokes, while it does not nec-                  as Amazon Mechanical Turk. None of the papers
essarily have any internal representation of humor.                   presented statistics on the demographics of the an-
In such a case, the humor is purely in the eyes of                    notators. This might be a source of bias in the
the beholder and in the data the model was trained                    results. What makes such a bias even more prob-
on, not in the method itself1 . For instance, Alnajjar                lematic is the relatively small number of annota-
et al. (2019) has shown that generated headlines                      tors that are usually recruited per individual out-
were perceived more offensive by human annota-                        put. Fields with more established human survey
tors, while offensiveness was never modelled in the                   practices would not consider the typical 3-5 annota-
system.                                                               tors of NLP and CC enough even for a qualitative
   While mostly every paper we surveyed opts for                      survey, which requires 5-25 participants (Creswell,
coming up with their own evaluation metrics, it                       1998) or at least 6 participants (Morse, 1994). How-
is astonishing that these newly created evaluation                    ever, human evaluation is usually conducted quan-
settings are used as such. There are other fields                     titatively, which means that the number of annota-
dealing with human surveys that emphasize the                         tors depends heavily on multiple parameters and
need for conducting tests on your survey before                       requires planning and justification on its own right
conducting it in a larger scale to discover potential                 (Bell, 1991; Lenth, 2001; Lavrakas, 2008).
issues in your questionnaire (Collins, 2003; Presser
                                                                         It is also very well known that people do not
et al., 2004; Thomas, 2004). None of the paper
                                                                      perceive things in a vacuum but rather as a contin-
we surveyed discusses evaluation of evaluation. In-
                                                                      uum of stimuli where previously perceived stimuli
   1
     See Colton (2008) for discussion on the roles of the pro-        affect to the next one. This effect is called priming
grammer, program and perceiver in creative systems                    (see Henson 2009). To reduce the effect of priming

                                                                 89
or to have it consistent one should either shuffle           as broad as creative text generation.
the order in which the output is presented to the
annotators or keep it always the same. Priming is            4.1   Define the goals
especially in play in cases where annotators are to          From the very early on, it is important to define
evaluate outputs produced by different systems. In           what the goals of your system are (see Alnajjar and
such a case, output of a mediocre system might get           Hämäläinen 2018; Jordanous 2012). Try to be as
greatly boosted when presented together with out-            concrete and precise as possible at this step. Once
put by a bad systems. Nearly none of the papers we           you have your goals clearly stated, it is easy to see
surveyed discussed this aspect of their evaluation           the degree to which your implementation solution
setting.                                                     tries to achieve those goals and how much can
   Both CC and NLP have still a long way to go               be attributed to the method and how much to the
in order to reach to more sound human evaluation             training data. After this, the evaluation parameters
practices. However, INLG is still a step closer              will follow naturally from the goals you set for your
to scientific rigor as automated evaluation metrics          system. This way, the evaluation questions do not
were commonly used together with human evalu-                appear seemingly from nowhere but are motivated
ation, and sometimes as the only evaluation met-             by your research goals and implementation.
ric (such as Bień et al. 2020), whereas ICCC had
                                                             4.2   Go concrete
several papers presenting work on creative NLG
without any evaluation at all (such as Agafonova             People have an inbuilt need to understand anything
et al. 2020; Petac et al. 2020; Wright and Purver            expressed in their language (see Veale 2016). This
2020).                                                       can lead easily into a situation, where annotators
   The use of experts in evaluation is something that        can read more into the evaluated output than what
should be taken under rigorous inspection in the             your system was aware of. By using evaluation
future. Currently, there are contradicting studies on        questions that are as concrete as possible you can
the topic indicating that consulting expert does have        reduce the room for subjective interpretation (see
an effect in machine translation (Toral et al., 2018)        Hämäläinen and Alnajjar 2019). For example, for
but not in poem generation (Lamb et al., 2017).              a pun like Becoming a vegetarian is a big missed
However, this is a question that is very likely to           steak asking the annotators Is this humorous? and
depend on the output that is to be evaluated and             Is this humorous because the pun ”missed steak”
also on how the evaluation is conducted.                     sounds like ”mistake”? will result in different pos-
   Human computer interaction research has some              sible interpretations as the former question might
more established methodologies for conducting hu-            let the annotators consider the generated joke funny
man studies (see Jacko 2012; Lazar et al. 2017               for reasons other than those intended by the gener-
such as cognitive walk-through (see Mahatody et al.          ative system.
2010), human performance evaluation support sys-             4.3   Run some tests
tem (Ha et al., 2007) and user studies (see MacKen-
zie 2015). These established methodologies could             As we have seen in this paper, the same concept
be taken into account when conducting evaluation             can be evaluated through multiple different word-
of such an NLG system that calls for user interac-           ings and it is not always clear that the annotators
tion.                                                        understand the questions in the same way as the
                                                             researchers intended. By running tests on your sur-
4   Advices for future evaluation                            vey in real life, you can get more direct feedback
                                                             than what you could get from annotators on Ama-
In this section, we outline how human evaluation             zon Mechanical Turk. It is better to adjust your
of creative NLG systems should be conducted. We              evaluation questions sooner than after running a
are not going to give an exact silver bullet frame-          costly crowd-sourcing.
work to solve the problem, as the two fields are                Furthermore, the final number of annotators you
not at the state yet where enough would be known             need and how many samples you should evaluate
about human evaluation to state exactly how the              depends on the evaluation task and setting. If you
evaluation needs to be conducted. Furthermore,               get high diversity in answers in the test run, you
we do not believe that a single fixed framework is           will probably need to have a larger number of an-
enough to capture everything necessary in a topic            notators conducting the actual evaluation.

                                                        90
Testing is also a great way of seeing whether             There have been many different evaluation methods
you are asking non-experts to evaluate things they           including some unconventional ones such as critics’
consider too difficult or whether your questionnaire         reviews and user testing. The most typical human
is too lengthy. You do not want your annotators              evaluation method has been using a scaled survey,
to lose interest in the middle of the questionnaire          typically on a 5 point scale.
and start annotating fast without paying too much               While most of the papers surveyed had come
attention.                                                   up with their own evaluation metrics, the most
                                                             common parameters that have been evaluated were
4.4    Run multiple evaluations                              meaning, syntactic correctness, novelty, relevance
Human evaluation does not need to be a one time              and emotional value. Although, the terms used to
thing conducted in a massive survey. You can run             refer to these notions have not been the same.
multiple different evaluations such as preference               Most of the papers did not justify why they had
based ones, 5 point scale ones and true and false            evaluated certain parameters. Instead, the param-
statements to better understand the limitations of           eters were usually just stated as though they were
your system and your human evaluation. The more              an inarguable fact. It was more often than not the
evidence gathered by different evaluation methods            case that the actual evaluation questions were not
you can show, the more confident you and other               revealed.
researchers can be of the quality of your method.               There was a lot of variation in the number of
                                                             samples taken from the system output and how
4.5    Report everything clearly
                                                             many annotators were used to conduct the evalu-
It is important to report the evaluation questions           ation. Typically the numbers were rather small.
exactly as they were used, how the survey form was           There was no discussion about the demographics
constructed including any instructions and wording           of the annotators nor about what type of a bias it
used for the 5 point scale, and how the output was           might have introduced.
presented (always in the same order or shuffled).               Evaluation setups were never tested out before-
All these have an effect on the results. In software         hand, even though other fields dealing with human
engineering, it is considered important to report            surveys recommend testing your questionnaires.
any threats to the validity of the research (Feldt           This means that it is impossible to tell what the an-
and Magazinius, 2010). The same should apply to              notators really understood by the evaluation ques-
NLP and CC. One of the important threats to the              tions.
validity of human evaluation is bias in the results.
                                                                We established some advices for future evalua-
Therefore, it is important to report and discuss what
                                                             tion, which include clearly defining the goal of the
kind of people participated in the evaluation survey.
                                                             generative system, asking questions as concrete as
4.6    Analyze your results                                  possible, testing the evaluation setup, using multi-
                                                             ple different evaluation setups, reporting the entire
It is also important to dig deeper into the human            evaluation process and potential biases clearly, and
evaluation results. If you as a researcher put a con-        finally analyzing the evaluation results in a more
siderable amount of money in getting your human              profound way than merely reporting the most typi-
evaluation results, you should probably make the             cal statistics.
most out of them too. Instead of merely reporting
                                                                All in all, our fields, CC and NLP, have a lot
the typical stats (mean, mode, median, standard
                                                             to learn from other fields with longer traditions
deviation), why not looking into the best and worst
                                                             with human questionnaires in terms of conducting
performing output by the system as well and let
                                                             human evaluation. At the current stage, none of the
the human evaluation be a guide in a deeper error
                                                             papers we surveyed quite reached the same level
analysis? This can open up insightful directions for
                                                             of scientific rigor in their human evaluation as it is
future research.
                                                             to be expected in other fields of science. However,
5     Conclusions                                            this is not to say that the work of the authors of
                                                             the papers we surveyed is inherently bad. This is
In this paper, we have surveyed papers presenting            just to highlight the fact that more attention needs
work on creative natural language generation that            to be paid in how human evaluation is conducted.
have been published in INLG 2020 and ICCC 2020.              Quite often with creative text generation, human

                                                        91
judgment is the only viable metric to measure the                 tion. In Proceedings of the 13th International Con-
performance of a system. Human evaluation of                      ference on Natural Language Generation, pages 22–
                                                                  28, Dublin, Ireland. Association for Computational
generated text has been conducted in the field of
                                                                  Linguistics.
NLP already as early as in the 1960s (McDaniel
et al., 1967) it is a pity it has not caught up with the        Margaret A Boden. 2007.      Creativity in a nutshell.
rest of the development in the field.                            Think, 5(15):83–96.

                                                                Wändi Bruine de Bruin, Martine Baldassi, Bernd
                                                                  Figner, Baruch Fischhoff, Lauren Fleishman, David
References                                                        Hardisty, Eric Johnson, Gideon Keren, Maria Kon-
Yana Agafonova, Alexey Tikhonov, and Ivan P                       nikova, and Irwin Levin. 2011. Framing effects in
  Yamshchikov. 2020. Paranoid transformer: Read-                  surveys: How respondents make sense of the ques-
  ing narrative of madness as computational approach              tions we ask. In Perspectives on Framing, edited by
  to creativity. In Eleventh International Conference             Gideon Keren, 303–325. New-York, NY. Taylor and
  on Computational Creativity: ICCC’20. Association              Francis Group, Psychology Press.
  for Computational Creativity.
                                                                Wändi Bruine de Bruin, Wilbert Van der Klaauw, Gior-
Imen Akermi, Johannes Heinecke, and Frédéric                    gio Topa, Julie S Downs, Baruch Fischhoff, and
  Herledan. 2020. Tansformer based natural language               Olivier Armantier. 2012. The effect of question
  generation for question-answering. In Proceedings               wording on consumers’ reported inflation expecta-
  of the 13th International Conference on Natural Lan-            tions. Journal of Economic Psychology, 33(4):749–
  guage Generation, pages 349–359, Dublin, Ireland.               757.
  Association for Computational Linguistics.
                                                                Lee Cheatley, Margareta Ackerman, Alison Pease, and
Khalid Alnajjar and Mika Hämäläinen. 2018. A                   Wendy Moncur. 2020. Co-creative songwriting for
  master-apprentice approach to automatic creation of             bereavement support. In Eleventh International
  culturally satirical movie titles. In Proceedings of            Conference on Computational Creativity: ICCC’20,
  the 11th International Conference on Natural Lan-               pages 33–41. Association for Computational Cre-
  guage Generation, pages 274–283.                                ativity.

Khalid Alnajjar, Leo Leppänen, and Hannu Toivonen.             Debbie Collins. 2003. Pretesting survey instruments:
  2019. No time like the present: methods for gener-              an overview of cognitive methods. Quality of life
  ating colourful and factual multilingual news head-             research, 12(3):229–238.
  lines. In Proceedings of the 10th International Con-
  ference on Computational Creativity. Association              Simon Colton. 2008. Creativity versus the perception
  for Computational Creativity.                                   of creativity in computational systems. In AAAI
                                                                  spring symposium: creative intelligent systems, vol-
Prithviraj Ammanabrolu, William Broniec, Alex                     ume 8.
   Mueller, Jeremy Paul, and Mark Riedl. 2020. To-
   ward automated quest generation in text-adventure            John W Creswell. 1998. Qualitative inquiry and
   games. In Proceedings of the 11th International                research design: Choosing among five traditions.
   Conference on Computational Creativity.                        Sage publications.

John F Bell. 1991. Big is not necessarily beautiful in          Robert Feldt and Ana Magazinius. 2010. Validity
  survey design: Measurement error and the apu sci-               threats in empirical software engineering research-
  ence survey. Journal of the Royal Statistical Society:          an initial survey. In Seke, pages 374–379.
  Series D (The Statistician), 40(3):291–300.
                                                                Sebastian Gehrmann, Tosin Adewumi, Karmanya Ag-
Anya Belz, Simon Mille, and David M. Howcroft.                    garwal, Pawan Sasanka Ammanamanchi, Aremu
  2020. Disentangling the properties of human eval-               Anuoluwapo, Antoine Bosselut, Khyathi Raghavi
  uation methods: A classification system to support              Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D.
  comparability, meta-evaluation and reproducibility              Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek,
  testing. In Proceedings of the 13th International               Chris Emezue, Varun Gangal, Cristina Garbacea,
  Conference on Natural Language Generation, pages                Tatsunori Hashimoto, Yufang Hou, Yacine Jernite,
  183–194, Dublin, Ireland. Association for Computa-              Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir
  tional Linguistics.                                             Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan,
                                                                  Mounica Maddela, Khyati Mahajan, Saad Ma-
Jelke Bethlehem. 2010. Selection bias in web surveys.             hamood, Bodhisattwa Prasad Majumder, Pedro Hen-
   International Statistical Review, 78(2):161–188.               rique Martins, Angelina McMillan-Major, Simon
                                                                  Mille, Emiel van Miltenburg, Moin Nadeem, Shashi
Michał Bień, Michał Gilski, Martyna Maciejewska,                 Narayan, Vitaly Nikolaev, Rubungo Andre Niy-
  Wojciech Taisner, Dawid Wisniewski, and Ag-                     ongabo, Salomey Osei, Ankur Parikh, Laura Perez-
  nieszka Lawrynowicz. 2020. RecipeNLG: A cook-                   Beltrachini, Niranjan Ramesh Rao, Vikas Raunak,
  ing recipes dataset for semi-structured text genera-            Juan Diego Rodriguez, Sashank Santhanam, João

                                                           92
Sedoc, Thibault Sellam, Samira Shaikh, Anasta-                Rik Henson. 2009. Priming. In Encyclopedia of Neu-
  sia Shimorina, Marco Antonio Sobrevilla Cabezudo,               roscience, volume 7. Academic Press.
  Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi
  Yang, Akhila Yerukola, and Jiawei Zhou. 2021.                 David M. Howcroft, Anya Belz, Miruna-Adriana
  The gem benchmark: Natural language genera-                     Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad
  tion, its evaluation and metrics. arXiv preprint                Mahamood, Simon Mille, Emiel van Miltenburg,
  arXiv:2011.11928.                                               Sashank Santhanam, and Verena Rieser. 2020.
                                                                  Twenty years of confusion in human evaluation:
Pablo Gervás. 2017. Template-free construction of                NLG needs evaluation sheets and standardised def-
  rhyming poems with thematic cohesion. In Proceed-               initions. In Proceedings of the 13th International
  ings of the Workshop on Computational Creativity                Conference on Natural Language Generation, pages
  in Natural Language Generation (CC-NLG 2017),                   169–182, Dublin, Ireland. Association for Computa-
  pages 21–28, Santiago de Compostela, Spain. Asso-               tional Linguistics.
  ciation for Computational Linguistics.
                                                                Nikolai Ilinykh and Simon Dobnik. 2020. When an im-
Hongyu Gong, Linfeng Song, and Suma Bhat. 2020.                   age tells a story: The role of visual and semantic in-
  Rich syntactic and semantic information helps unsu-             formation for generating paragraph descriptions. In
  pervised text style transfer. In Proceedings of the             Proceedings of the 13th International Conference
  13th International Conference on Natural Language               on Natural Language Generation, pages 338–348,
  Generation, pages 113–119, Dublin, Ireland. Asso-               Dublin, Ireland. Association for Computational Lin-
  ciation for Computational Linguistics.                          guistics.
                                                                Julie A Jacko. 2012. Human computer interaction
Zerrin Asan Greenacre. 2016. The importance of selec-
                                                                   handbook: Fundamentals, evolving technologies,
  tion bias in internet surveys. Open Journal of Statis-
                                                                   and emerging applications. CRC press.
  tics, 6(03):397.
                                                                Anna Jordanous. 2012. A standardised procedure for
Jun Su Ha, Poong Hyun Seong, Myeong Soo Lee, and                  evaluating creative systems: Computational creativ-
  Jin Hyuk Hong. 2007. Development of human per-                  ity evaluation based on what it is to be creative. Cog-
  formance measures for human factors validation in               nitive Computation, 4(3):246–279.
  the advanced mcr of apr-1400. IEEE Transactions
  on Nuclear Science, 54(6):2687–2700.                          Graham Kalton and Howard Schuman. 1982. The ef-
                                                                  fect of the question on survey responses: A review.
Mika Hämäläinen. 2020.     Generating Creative                 Journal of the Royal Statistical Society: Series A
  Language-Theories, Practice and Evaluation. Uni-                (General), 145(1):42–57.
  versity of Helsinki.
                                                                Carolyn Lamb, Daniel G Brown, and Charles LA
Mika Hämäläinen and Khalid Alnajjar. 2019. Let’s               Clarke. 2017. Incorporating novelty, meaning, reac-
  face it. finnish poetry generation with aesthetics and          tion and craft into computational poetry: a negative
  framing. In Proceedings of the 12th International               experimental result. In ICCC, pages 183–188.
  Conference on Natural Language Generation, pages
  290–300.                                                      Carolyn Lamb, Daniel G Brown, and Charles LA
                                                                  Clarke. 2018. Evaluating computational creativity:
Mika Hämäläinen and Khalid Alnajjar. 2021. The great           An interdisciplinary tutorial. ACM Computing Sur-
  misalignment problem in human evaluation of NLP                 veys (CSUR), 51(2):1–34.
  methods. In Proceedings of the Workshop on Hu-
  man Evaluation of NLP Systems (HumEval), pages                Maurice Langner. 2020. OMEGA : A probabilistic ap-
  69–74, Online. Association for Computational Lin-              proach to referring expression generation in a vir-
  guistics.                                                      tual environment. In Proceedings of the 13th Inter-
                                                                 national Conference on Natural Language Genera-
Mika Hämäläinen, Niko Partanen, Khalid Alnajjar,              tion, pages 296–305, Dublin, Ireland. Association
  Jack Rueter, and Thierry Poibeau. 2020. Automatic              for Computational Linguistics.
  dialect adaptation in finnish and its effect on per-          Paul J Lavrakas. 2008. Sample size. In Encyclopedia
  ceived creativity. In 11th International Conference             of survey research methods. Sage publications.
  on Computational Creativity (ICCC’20). Associa-
  tion for Computational Creativity.                            Jonathan Lazar, Jinjuan Heidi Feng, and Harry
                                                                  Hochheiser. 2017. Research methods in human-
Behnam Hedayatnia,         Karthik Gopalakrishnan,                computer interaction. Morgan Kaufmann.
  Seokhwan Kim, Yang Liu, Mihail Eric, and
  Dilek Hakkani-Tur. 2020. Policy-driven neural                 Joosung Lee. 2020. Stable style transformer: Delete
  response generation for knowledge-grounded dialog               and generate approach with encoder-decoder for text
  systems. In Proceedings of the 13th International               style transfer. In Proceedings of the 13th Inter-
  Conference on Natural Language Generation,                      national Conference on Natural Language Genera-
  pages 412–421, Dublin, Ireland. Association for                 tion, pages 195–204, Dublin, Ireland. Association
  Computational Linguistics.                                      for Computational Linguistics.

                                                           93
Russell V Lenth. 2001. Some practical guidelines for           Janice M Morse. 1994. Designing funded qualitative
  effective sample size determination. The American               research. Sage Publications, Inc.
  Statistician, 55(3):187–193.
                                                               Nikola I. Nikolov, Eric Malmi, Curtis Northcutt, and
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,                Loreto Parisi. 2020. Rapformer: Conditional rap
   Michel Galley, and Jianfeng Gao. 2016. Deep rein-             lyrics generation with denoising autoencoders. In
   forcement learning for dialogue generation. In Pro-           Proceedings of the 13th International Conference
   ceedings of the 2016 Conference on Empirical Meth-            on Natural Language Generation, pages 360–373,
   ods in Natural Language Processing, pages 1192–               Dublin, Ireland. Association for Computational Lin-
  1202.                                                          guistics.

Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi,                  Jason Obeid and Enamul Hoque. 2020. Chart-to-text:
  Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Lin-                Generating natural language descriptions for charts
  jun Shou, Ming Gong, et al. 2020. Glge: A new                   by adapting the transformer model. In Proceedings
  general language generation evaluation benchmark.               of the 13th International Conference on Natural Lan-
  arXiv preprint arXiv:2011.11928.                                guage Generation, pages 138–147, Dublin, Ireland.
                                                                  Association for Computational Linguistics.
I Scott MacKenzie. 2015. User studies and usability
   evaluations: from research to products. In Graphics         Hugo Gonçalo Oliveira. 2020. Weirdanalogymatic:
   Interface, pages 1–8.                                         Experimenting with analogy for lyrics transforma-
                                                                 tion. In Eleventh International Conference on Com-
Thomas Mahatody, Mouldi Sagar, and Christophe                    putational Creativity: ICCC’20.
  Kolski. 2010.     State of the art on the cogni-
  tive walkthrough method, its variants and evolu-             Alison Pease and Simon Colton. 2011. On impact and
  tions. Intl. Journal of Human–Computer Interac-                evaluation in computational creativity: A discussion
  tion, 26(8):741–785.                                           of the turing test and an alternative proposal. In Pro-
                                                                 ceedings of the AISB symposium on AI and Philoso-
Kory Mathewson, Pablo Samuel Castro, Colin Cherry,               phy, volume 39.
  George Foster, and Marc G Bellemare. 2020. Shap-
  ing the narrative arc: Information-theoretic collabo-        Andreea-Oana Petac, Anne-Gwenn Bosser, Fred
  rative dialogue.                                               Charles, Pierre De Loor, and Marc Cavazza. 2020.
                                                                 A pragmatics-based model for narrative dialogue
J. McDaniel, W.L. Price, A.J.M. Szanser, and D.M.                generation. In Eleventh International Conference on
   Yates. 1967. An evaluation of the usefulness of ma-           Computational Creativity: ICCC’20.
   chine translations produced at the national physical
   laboratory, teddington, with a summary of the trans-        Stanley Presser, Mick P Couper, Judith T Lessler, Eliz-
   lation methods. In COLING 1967 Volume 1: Confer-               abeth Martin, Jean Martin, Jennifer M Rothgeb, and
   ence Internationale Sur Le Traitement Automatique              Eleanor Singer. 2004. Methods for testing and eval-
   Des Langues.                                                   uating survey questions. Public opinion quarterly,
                                                                  68(1):109–130.
Rui Mendes and Hugo Gonçalo Oliveira. 2020a. Am-
  plifying the range of news stories with creativity:          Matthew Purver, Pablo Gervás, and Sascha Griffiths,
  Methods and their evaluation, in Portuguese. In               editors. 2016. Proceedings of the INLG 2016 Work-
  Proceedings of the 13th International Conference              shop on Computational Creativity in Natural Lan-
  on Natural Language Generation, pages 252–262,                guage Generation. Association for Computational
  Dublin, Ireland. Association for Computational Lin-           Linguistics, Edinburgh, UK.
  guistics.
                                                               Richard Savery, Lisa Zahray, and Gil Weinberg. 2020.
Rui Mendes and Hugo Gonçalo Oliveira. 2020b. Com-               Shimon the rapper: A real-time system for human-
  paring different methods for assigning portuguese              robot interactive rap battles. In Eleventh Inter-
  proverbs to news headlines. In Eleventh Inter-                 national Conference on Computational Creativity:
  national Conference on Computational Creativity:               ICCC’20.
  ICCC’20.
                                                               Thomas Scialom, Patrick Bordes, Paul-Alexis Dray, Ja-
Rui Mendes and Hugo Gonçalo Oliveira. 2020c. Teco:              copo Staiano, and Patrick Gallinari. 2020. What
  Exploring word embeddings for text adaptation to               BERT sees: Cross-modal transfer for visual question
  a given context. Eleventh International Conference             generation. In Proceedings of the 13th International
  on Computational Creativity: ICCC’20.                          Conference on Natural Language Generation, pages
                                                                 327–337, Dublin, Ireland. Association for Computa-
Piotr Mirowski, Kory Mathewson, Boyd Branch,                     tional Linguistics.
  Thomas Winters, Ben Verhoeven, and Jenny Elfv-
   ing. 2020. Rosetta code: Improv in any language.            Juliana Shihadeh and Margareta Ackerman. 2020.
   In Proceedings of the 11th International Conference            Emily: An emily dickinson machine. In Eleventh
   on Computational Creativity, pages 115–122. Asso-              International Conference on Computational Creativ-
   ciation for Computational Creativity.                          ity: ICCC’20.

                                                          94
Brad Spendlove and Dan Ventura. 2020. Creating six-
  word stories via inferred linguistic and semantic for-
  mats. In Proceedings of the 11th International Con-
  ference on Computational Creativity, under review.
Susan J Thomas. 2004. Pilot testing the questionnaire.
  Using web and paper questionnaires for Data-Based
  Decision Making.
Jukka Toivanen, Hannu Toivonen, Alessandro Valitutti,
  and Oskar Gross. 2012. Corpus-based generation
  of content and form in poetry. In Proceedings of
  the third international conference on computational
  creativity. University College Dublin.
Antonio Toral, Sheila Castilho, Ke Hu, and Andy
  Way. 2018. Attaining the unattainable? reassessing
  claims of human parity in neural machine translation.
  In Proceedings of the Third Conference on Machine
  Translation: Research Papers, pages 113–123.
Bradley Tyler, Katherine Wilsdon, and Paul Bodily.
  2020. Computational humor: Automated pun gener-
  ation. In Eleventh International Conference on Com-
  putational Creativity: ICCC’20.
Tony Veale. 2016. 3. The shape of tweets to come: Au-
  tomating language play in social networks, pages
  73–92. De Gruyter Mouton.
Qingyun Wang, Qi Zeng, Lifu Huang, Kevin Knight,
  Heng Ji, and Nazneen Fatema Rajani. 2020. Re-
  viewRobot: Explainable paper review generation
  based on knowledge synthesis. In Proceedings of
  the 13th International Conference on Natural Lan-
  guage Generation, pages 384–397, Dublin, Ireland.
  Association for Computational Linguistics.
G Wright and Matthew Purver. 2020. Creative lan-
  guage generation in a society of engagement and
  reflection. In Proceedings of the Eleventh Interna-
  tional Conference on Computational Creativity. As-
  sociation for Computational Creativity (ACC).

                                                           95
You can also read