Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers - ACL Anthology

Page created by Jay Burgess

Education

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Human Evaluation of Creative NLG Systems: An Interdisciplinary
                        Survey on Recent Papers

                Mika Hämäläinen           Khalid Alnajjar
                  Faculty of Arts             Faculty of Arts
               University of Helsinki      University of Helsinki
         mika.hamalainen@helsinki.fi khalid.alnajjar@helsinki.fi

                       Abstract                                   NLP and CC fields conduct work from very dif-
     We survey human evaluation in papers present-             ferent starting points (Purver et al., 2016). NLP is
     ing work on creative natural language genera-             often state-of-the-art driven whereas CC presents
     tion that have been published in INLG 2020                more of exploratory research without pursuing
     and ICCC 2020. The most typical human eval-               scores that outperform a baseline. In this paper,
     uation method is a scaled survey, typically on            we want to study how human evaluation of creative
     a 5 point scale, while many other less common             NLG systems is conducted in the world of NLP and
     methods exist. The most commonly evalu-
                                                               in the world of CC, what similarities there are and
     ated parameters are meaning, syntactic correct-
     ness, novelty, relevance and emotional value,             whether the two fields can learn something from
     among many others. Our guidelines for future              each other.
     evaluation include clearly defining the goal of              We base our research on a literature review on
     the generative system, asking questions as con-           the papers dealing with human evaluated creative
     crete as possible, testing the evaluation setup,          NLG published in the 2020 editions of the Inter-
     using multiple different evaluation setups, re-
                                                               national Conference on Computational Creativity
     porting the entire evaluation process and po-
     tential biases clearly, and finally analyzing the         (ICCC) and of the International Conference on
     evaluation results in a more profound way than            Natural Language Generation (INLG). We picked
     merely reporting the most typical statistics.             these conferences as ICCC is the most important
                                                               venue for CC research, and INLG the most impor-
1    Introduction
                                                               tant NLP focused venue for NLG research.
Human evaluation in natural language generation                   Our results show that there is no consensus at the
(NLG) has become a hot topic lately, with the emer-            moment on how evaluation should be conducted
gence of several survey papers on the topic that               despite the many different efforts of establishing
study how human evaluation has been conducted in               guidelines for evaluating computationally creative
the past in the field of NLG in general (Howcroft              output (Pease and Colton, 2011; Jordanous, 2012;
et al., 2020; Belz et al., 2020). This has led to              Lamb et al., 2018; Hämäläinen, 2020). We reflect
several recent evaluation frameworks for evaluat-              on the results of our survey and propose a road-map
ing the output of NLG systems (Liu et al., 2020;               for more sound future evaluation practices.
Gehrmann et al., 2021).
   However, not all natural language generation                2   Surveying human evaluation methods
tasks are of the nature that they are designed to con-
vey factual information. Some of the NLG tasks                 In this section, we go trough how human evaluation
deal with producing text of aesthetic nature such              was conducted in the papers we selected for the
as poetry, stories, humor and so on. We call these             survey. From the ICCC proceedings, we included
creative NLG tasks. These types of tasks are si-               all papers that dealt with NLG and had a human
multaneously researched in two distinct fields of              evaluation. We did not survey papers that presented
science: natural language processing (NLP) and                 work on generating something else than language
computational creativity (CC). Existing survey pa-             such as music. From the INLG proceedings, we
pers have only focused on NLP research and they                picked all papers that presented work on an open-
have not made a distinction between creative and               ended NLG problem the output of which could
non-creative NLG.                                              exhibit some creativity ruling out papers that dealt

                                                          84
    Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 84–95
                            August 5–6, 2021. ©2021 Association for Computational Linguistics

Paper                                    NLG task                    Evaluated parameters                               Questions motivated                  Evaluation type
                                                                                                                        Engagement measured the
1. Mathewson et al. 2020                 Collaborative dialogue      engagement                                                                              Ranking models
                                                                                                                        notions of revealing and concealing.
                                                                     Support of self-expression, therapeutic value
2. Cheatley et al. 2020                  Song writing tool                                                              Not discussed                       User study (qualitative)
                                                                     and receptiveness to the tool and songs created
                                         Auxiliary tool for
3. Mirowski et al. 2020                                              Based on critics’ previews and reviews             No questions                        public performance
                                         improv theater
                                         Generating six
4. Spendlove and Ventura 2020                                        coherence, impactfulness                           Not discussed                       5 point scale
                                         word stories
                                         Quest generation in         coherence, originality (novelty), sense of
5. Ammanabrolu et al. 2020                                                                                              By Boden’s theory on creativity     7 point scale
                                         text adventure games        acomplishment (value), unpredictability (surprise)
                                         Headline-proverb
6. Mendes and Oliveira 2020b                                         relatedness, funniness                             Not discussed                       4 point scale
                                         pair generation
                                                                     funniness, surprise, cleverness, did the user laugh,
7. Tyler et al. 2020                     Pun generation                                                                   Not discussed                     5 point scale
                                                                     wit, ingenuity, timelessness, and accessibility
                                         Contextual headline
8. Mendes and Oliveira 2020c                                         syntax, relatedness, funniness                     Not discussed                       3 point scale
                                         adaptation
                                                                     poem (yes/no), typicality, understandability,
9. Hämäläinen et al. 2020 evaluation 1 Dialectal adaptation       quality of language, evoked imagery,               Previous research                   5 point scale
                                          of generated poetry        evoked emotions, annotator’s liking
                                                                     emotivity, originality, creativity,
Hämäläinen et al. 2020 evaluation 2                                                                                  Not discussed                       Association
                                                                     poem-likeness, artificiality, fluency
                                                                     annotator’s perception, coherence, rhythm,
                                                                                                                                                            open ended questions
                                         Real time human-            rhyme, quality, enjoyment, relation between
10. Savery et al. 2020                                                                                                  By research questions               + automatic analysis,
                                         machine rap battles         the hip hop and metal dataset,
                                                                                                                                                            preference
                                                                     and relationship between input and output
                                                                     familiarity, novelty, grammaticality,
                                         Song lyric                                                                                                         5 point scale and
11. Oliveira 2020                                                    semantics, singability, overall appreciation       Not discussed
                                         transformation                                                                                                     picking the most suitable topic
                                                                     and topicality
                                                                     typicality, understandability,
                                         Emily Dickinson style
12. Shihadeh and Ackerman 2020                                       quality of language, evoked imagery,               Previous research                   5 point scale
                                         poem generation
                                                                     evoked emotions, annotator’s liking
                                                                     content preservation, transfer strength
13. Gong et al. 2020                     Text style transfer                                                            Automated evaluation                picking the best
                                                                     and fluency
                                                                                                                                                            5 point scale and
                                         Text generation             informativeness, conciseness,
14. Obeid and Hoque 2020                                                                                                Not discussed                       yes/no/partially/can’t decide
                                         from charts                 coherence, fluency, factuality
                                                                                                                                                            for factuality
15. Lee 2020                             Style transform             content, fluency, and style                        Not discussed                       5 point scale
                                         Enhancing headlines
16. Mendes and Oliveira 2020a                                        relatedness, funniness                             Not discussed                       4 point scale
                                         with creative expressions
                                         Referring expression        comprehension based on
                                                                                                                                                            user study based on
17. Langner 2020                         generation in a             identification time, error rate                    Not discussed
                                                                                                                                                            quantitative values
                                         virtual environment         and repetition counts
                                         Question generation         readability, caption relevance
18. Scialom et al. 2020                                                                                                 Not discussed                       5 point scale
                                         from images                 and image relevance
                                         Multi-sentence image        word choice, object salience,
19. Ilinykh and Dobnik 2020                                                                                             Not discussed                       slider
                                         description generation      sentence structure and paragraph coherence
                                                                                                                                                            relevance (correct/not correct),
20. Akermi et al. 2020                   Question answering          relevance, errors                                  Not discussed                       error type checkboxes,
                                                                                                                                                            open ended comment field
21. Nikolov et al. 2020 evaluation 1                                 style, meaning, familiarity                        Not discussed                       5 point scale
                                         Rap lyric generation                                                                                               picking which out of 2
Nikolov et al. 2020 evaluation 2                                     Turing test                                        Not discussed
                                                                                                                                                            is written by a human
Nikolov et al. 2020 evaluation 3                                 Turing test                                            Not discussed                       human written (yes/no)
22. Wang et al. 2020                     Paper review generation constructivenness and validity                         Not discussed                       not stated
                                         Response generation
23. Hedayatnia et al. 2020                                       appropriateness                                        Previous research                   picking the best
                                         in a dialog system

                    Table 1: Evaluated parameters, their motivation and evaluation type in the surveyed papers

with purely factual data-to-text generation tasks.                                                    or laymen. Finally, we looked into the discussions
   In the ICCC 2020, there were 12 papers that pre-                                                   and conclusions presented in the papers to see what
sented human evaluated work on creative NLG, and                                                      role the human evaluation had there, especially in
in the INLG 2020, there were 11 such papers. We                                                       relation to concrete future directions in improving
selected these papers for our survey. Fortunately,                                                    the system based on the evaluation results.
both of the venues had relatively the same amount
of papers.                                                                                            2.1     What is evaluated?
   When surveying the papers, we only focused on                                                      Table 1 shows the results of our survey in terms
human evaluation and we wanted to know what                                                           of what parameters were evaluated and how the
the NLG task was, what parameters were being                                                          evaluation was conducted. Papers 1-12 were pub-
evaluated (usually reflected by the evaluation ques-                                                  lished in ICCC and represent the CC field, whereas
tions), how these parameters (questions) were moti-                                                   papers 13-23 were published in INLG representing
vated and how the actual evaluation was conducted                                                     the NLP side of the same coin.
methodologically. We also paid attention to the                                                          When looking at the results, we can immediately
evaluation setup: the number of evaluators and sam-                                                   see that there is quite a range of different NLG
ples used and whether the evaluators were experts                                                     tasks. Even for papers that deal with very similar

                                                                                              85

tasks such as papers 2, 10, 11 and 21, the framing for evaluation. Paper 5 motivated the evaluated
of the problem is very different ranging from lyric parameters through an existing theory on compu-
transformation to full-blown human versus com- tational creativity (Boden, 2007). Paper 10 had
puter rap battles. The evaluated parameters were formulated the evaluated parameters based on the
also very different. research questions established in the paper. Paper
Despite the parameters being very different from 13 formulated the evaluated parameters so that they
each other, several papers evaluated meaning in would measure the same things as their automated
one way or another, for example, papers 4, 5, 10, evaluation.
14 and 19 evaluated coherence, paper 11 semantics Paper 9 and 12 used evaluation questions origi-
and paper 21 meaning. Papers 9 and 12 evaluated nally established by Toivanen et al. (2012). While
understandability, which is not directly the same as basing evaluation on existing research makes the
meaning. evaluation questions sound more well motivated,
Syntactic correctness of the language was also the original paper where these evaluation questions
one of the commonly evaluated features. Papers 9, were first established did not present any reasoning
13 and 14 measure fluency, paper 11 grammatical- as to why these should be the evaluation questions
ity, papers 9 and 12 quality of language and paper to be used with generated poetry. Also paper 23
8 syntax. In addition paper 18 evaluated readabil- stated they used ”a similar setup” as proposed by
ity, which is partially related to correctness and Li et al. (2016). In practice this meant that whereas
partially to meaning. the original paper proposed 3 different evaluation
One of the parameters that was evaluated setups, paper 23 only presented one of them. The
through multiple synonyms and even antonyms reasoning for this evaluation was not discussed in
was novelty. Papers 5 and 9 evaluated original- the original paper.
ity, paper 11 novelty, paper 7 surprise and paper
5 unpredictability. Papers 9 and 12 evaluated the 2.3 How is the evaluation conducted?
opposite of novelty, which is typicality.
Most of the papers present only one human eval-
Relevance was also commonly evaluated in pa- uation method. The exceptions are paper 9 that
pers 18 and 20. The parameter was evaluated as presents two distinct evaluation setups and paper
relatedness in papers 6, 8 an 16, although all of 21 that presents 3 distinct evaluation setups.
them are by the same authors.
The most common way of conducting a human
Many papers also evaluated emotional value. evaluation is to use a questionnaire that is rated on
Such as paper 9 through emotivity, paper 10 a numerical scale. Papers 4, 7, 9 (evaluation 1)
through enjoyment, paper 11 through engagement, 11, 12, 14, 15, 18 and 21 (evaluation 1) used a 5
papers 9 and 12 through evoked emotions and pa- point scale. Papers 6 and 16, written by the same
pers 7, 6, 8 and 16 through funniness, although authors, use a 4 point scale, and paper 8, also by
three of these papers were by the same authors. the same authors, uses a 3 point scale. Paper 7 uses
as big as a 7 point scale. The most deviant one of
2.2 Why are the evaluation parameters
the papers using a numeric scale is paper 19. This
chosen?
paper presents a continuous slider the annotators
The aforementioned parameters do not cover all can move freely. Some of the papers use a different
the parameters that were used in evaluation, how- scale for one of the questions.
ever, they were the most typical ones. When we The second most typical evaluation method is
look into how the evaluation parameters were se- based on preference. Here the outputs are pre-
lected, we can notice that most of the papers do ferred or ranked in relation to each other. Paper 1
not present any reasoning as to why these are the presents a ranking method where different models
relevant attributes to look at. are ranked based on which one is the best. Paper
The few papers that did present a reasoning, had 9 (evaluation 2) presents two poems side by side
many different reasons for the evaluated parame- and asks annotators to associate the presented pa-
ters. Paper 1 motivates the evaluated parameter rameters with either one of them. Paper 10 uses
by stating that it evaluates revealing and conceal- preference of output as one of the evaluation cri-
ing parameters that were defined important for the teria. Papers 13 and 23 ask the annotators to pick
task. Paper 3 did not have any parameters at all the best output candidate. Paper 21 (evaluation 2)

asks the annotators to guess which output is human 2.4 Sample sizes and annotators
written and which AI written. Paper 11 asks the
Table 2 shows the number of annotators and sample
annotators to rank the most suitable topic. This
sizes used in the different papers. We have tried
is slightly different as here the annotators are not
to do our best in collecting the information from
asked to rank the output per se. As we can see,
the papers, however, these parameters were not
there are a great number of different variations in
always expressed clearly. The worst example is
how this type of an evaluation is conducted. As op-
paper 3 that stated that they got multiple reviews,
posed to the most popular evaluation method, these
previews and feedback from the audience and the
methods only give relative results. This means that
actors without specifying the exact number.
even if all of the output was bad, one of them is
Most of the papers relied on non-expert anno-
still picked as the best.
tators for conducting the evaluation with the ex-
Two papers, 2 and 17, present a user-study. Pa- ception of paper 1, 21 and 22, and partially paper
per 2 conducts this in a qualitative way with open 3. The use of experts is understandable as not just
ended questions where the discussion is directed about anyone is competent enough to tell whether,
towards the parameters that the authors wanted to for example, generated reviews for scientific pa-
measure. The discussions with the participants are pers (as in paper 22) are good or bad. However,
not fully reported in the paper, instead the authors this leads to a small number of evaluators as ex-
present some quotes relating to the parameters in perts are difficult to recruit. Papers that did not use
study in a non-rigorous fashion. Paper 17 presents experts to evaluate the output either did not report
a quantitative user-study where the results are ana- any special requirements or mostly ensured that the
lyzed based on different values such as execution evaluators were proficient enough in the language
time that were gathered during the user-study. of the output.
Paper 3 presents something completely unique In terms of the sample size, that is how many gen-
in terms of evaluation. The authors organize live erated artefacts were evaluated, the amount varies
improv theater sessions with the system and base a lot from anything starting from 2 as in paper 5 up
the results on the reviews and previews by critics. to 250 as in papers 15 and 19. The samples were
However, these were not discussed in the paper in mostly picked at random, however some papers
detail, but rather some cherry picked quotations like paper 7 evaluated manually picked output.
were reported. There was also a lot of divergence in the num-
Paper 10 was another paper to conduct a qual- ber of annotators. Some papers had all annotators
itative evaluation. The annotators were asked to go through all samples like paper 21 and 22 did,
answer to open-ended questions. The input from while some other papers had several annotators
the annotators was then automatically processed to that annotated the outputs so that each individual
reach to conclusions. An open-ended comments output was evaluated at least by 3 annotators like
field was also provided in paper 20, however, the paper 14 and 23. Usually, there wasn’t any clear
paper focused on discussing the results of the two discussion on how many outputs a given annotator
other questions in the questionnaire. The annota- annotated with the exception of paper 19, which
tors were asked to give a binary rating on whether reported that a given annotator could only annotate
the output was relevant or not, similarly, paper 9 up to 30 outputs.
(evaluation 1) presented one binary question about
poeticity and Paper 21 (evaluation 3) presented a 2.5 Evaluation results
binary question whether the output was human au- An interesting point we wanted to pay attention to
thored. In addition, paper 20 asked the annotators was the use of the evaluation results. After conduct-
to indicate which types of errors the output had ing a costly and time consuming human evaluation,
by providing a set of check-boxes with predefined one would hope that the results give a direction to
error types. the future research. However, this was not the case.
Unlike the rest of the papers, paper 22 did not All papers were limited to writing out the evalua-
explain how the evaluation was conducted in any tion results and stating which system was better if
detail. The results were percentages, which indi- the papers evaluated multiple systems. None of the
cates that the evaluation might have been based on papers was able to identify any concrete future di-
binary questions. rections for improving the generative system based

Paper Experts Number of annotators Number of samples
3 conversations
1. Mathewson et al. 2020 yes 4 (5 utterance-response
pairs in each)
Free engagement
2. Cheatley et al. 2020 no 3
with the system
yes (reviews),
3. Mirowski et al. 2020 multiple Performance
no (audience)
4. Spendlove and Ventura 2020 no 14 per story 15 stories
5. Ammanabrolu et al. 2020 no 15 for each game 2 room layouts
6. Mendes and Oliveira 2020b no 4 per headline 60 headlines
10 best manually
7. Tyler et al. 2020 no 10 in total
selected puns
8. Mendes and Oliveira 2020c no 2 in total 30 headlines
9. Hämäläinen et al. 2020 evaluation 1 no 5 per poem variant 10 poems
Hämäläinen et al. 2020 evaluation 2 no 5 per dialectal-standard Finnish poem pair 10 parallel poems
1 video clip,
hand picked best output,
10. Savery et al. 2020 no 33
10 additional video clips
and 10 generated tasks
11. Oliveira 2020 no 3 per lyric 120 lyics
10 generated +
12. Shihadeh and Ackerman 2020 no 17 in total
2 Emily Dickinson’s poems
13. Gong et al. 2020 no 2 in total outputs for 100 inputs
14. Obeid and Hoque 2020 no 3 per statistic output for 40 charts
15. Lee 2020 no 6 people per sample 250 samples
16. Mendes and Oliveira 2020a no 4 per headline 60 headlines
17. Langner 2020 no 34 participants 10 fixed sessions
18. Scialom et al. 2020 no 3 in total 50 images
19. Ilinykh and Dobnik 2020 no 154 in total (a participant could rate at most 30 images) 250 images
20. Akermi et al. 2020 no 20 in total 150 questions
21. Nikolov et al. 2020 evaluation 1 yes 3 in total 100 verses
Nikolov et al. 2020 evaluation 2 yes 3 in total 100 verses
Nikolov et al. 2020 evaluation 3 yes 3 in total 100 verses
22. Wang et al. 2020 yes 2 in total 50 papers
200 snippets
23. Hedayatnia et al. 2020 no 3 per snippet
of 5 turn dialog

Table 2: Evaluators and samples in the surveyed papers

on the human evaluation results. Human evalua- 3 Discussion
tion was merely there to provide some convincing
There are currently many different creative NLG
evidence on the quality of the systems.
tasks people work with, and it is understandable
that each task calls for slightly different evaluation
The only exception to this was paper 9. The au- methods. However, even work on closely related
thors conducted two different evaluations and they topics prefers to use their own evaluation meth-
reached to an insightful conclusion. The two evalu- ods that are not based on any existing research.
ation methods contradicted each other; according to And most alarmingly, if the evaluation is based on
the first evaluation, standard Finnish was preferred existing research, the evaluation questions are not
over dialectal one in all the parameters. However, motivated in the earlier research either. This type of
the second evaluation showed that a dialectal poem evaluation has become to be known as a symptom
was more often associated with originality, creativ- of the Great Misalignment Problem (Hämäläinen
ity and poem-likeness than its standard Finnish and Alnajjar, 2021). When the evaluation is not
variant. The authors note that the results are not targeted towards evaluating exactly what has been
only dependent on how you conduct your human modelled, any type of evaluation that seems re-
evaluation, but also on familiarity bias. In the first motely related to the task becomes seemingly valid.
evaluation, where dialect was a controlled variable, However, when the evaluated parameters have
the further the dialect was from standard Finnish, only little to do with what was modelled, it is only
the lower it scored as the annotators were less fa- evident that none of the surveyed papers was able to
miliar with it. clearly identify the short-comings of their systems

in such a way that they could propose some clear stead, it is believed that any new evaluation metric
paths to follow for any future research. In fact, the authors came up with just for a given paper
if you evaluate your system based on relatedness will magically work as such and will yield scien-
and funniness while neither is explicitly modelled, tifically valid results that will pass a peer review.
how can you know how to make your system more All this while many of the papers ask questions us-
funny or produce more related output? The scores ing ambiguous terms such as fluency (is something
might have well been achieved by mere serendipity grammatical fluent? is something that seems to
(the annotators happened to like the humor that make sense semantically fluent? is something that
happened to be in the small sample) (c.f. Gervás is close to the annotator’s own idiolect more fluent
2017) or by data the model was trained on. than something further away from it? is text gener-
Apart from the evaluation questions not align- ated in American English more fluent to Americans
ing with the model, a much larger problem related than text generated in British English? and so on)
to evaluation questions can be identified. Firstly, and coherence (is something that repeats the same
most of the papers were not clear about the ac- words coherent? can a complex figure of language
tual evaluation questions used, instead they listed be coherent if the annotator does not have time to
the evaluated parameters as though human evalua- think about it for more than a couple of seconds?
tion was like an automated one where one can just does coherence have something to do with gram-
score abstract notions such as typicality or fluency maticality as well? is a story that follows the same
accurately on a 5 point scale. In other fields, it beliefs as the annotator seen as more coherent? and
is known that even small changes in survey ques- so on) that are reduced into a compact 1-5 scale
tions can lead to different survey results (Kalton that is later neatly averaged over all the annotators’
and Schuman, 1982; de Bruin et al., 2011, 2012). opinions on all the samples. What does the aver-
Not revealing the actual questions only makes the age of 3.5 on a question all annotators might have
situation worse. Another problem that rises from interpreted differently even mean?
abstract evaluation questions is that it becomes less In other fields conducting online surveys, there
clear why the annotators gave certain answers. are a lot of worries about selection bias of the hu-
Furthermore, people have a tendency on reading man subjects (Bethlehem, 2010; Greenacre, 2016).
more into computer generated output than what the This is hardly discussed in the fields of NLP and
intention of the system was (Veale, 2016). If you CC. Many of the papers we surveyed conducted
train a generative neural model on jokes, it will their evaluation on a crowd-sourcing platform such
surely learn to output jokes, while it does not nec- as Amazon Mechanical Turk. None of the papers
essarily have any internal representation of humor. presented statistics on the demographics of the an-
In such a case, the humor is purely in the eyes of notators. This might be a source of bias in the
the beholder and in the data the model was trained results. What makes such a bias even more prob-
on, not in the method itself1 . For instance, Alnajjar lematic is the relatively small number of annota-
et al. (2019) has shown that generated headlines tors that are usually recruited per individual out-
were perceived more offensive by human annota- put. Fields with more established human survey
tors, while offensiveness was never modelled in the practices would not consider the typical 3-5 annota-
system. tors of NLP and CC enough even for a qualitative
While mostly every paper we surveyed opts for survey, which requires 5-25 participants (Creswell,
coming up with their own evaluation metrics, it 1998) or at least 6 participants (Morse, 1994). How-
is astonishing that these newly created evaluation ever, human evaluation is usually conducted quan-
settings are used as such. There are other fields titatively, which means that the number of annota-
dealing with human surveys that emphasize the tors depends heavily on multiple parameters and
need for conducting tests on your survey before requires planning and justification on its own right
conducting it in a larger scale to discover potential (Bell, 1991; Lenth, 2001; Lavrakas, 2008).
issues in your questionnaire (Collins, 2003; Presser
It is also very well known that people do not
et al., 2004; Thomas, 2004). None of the paper
perceive things in a vacuum but rather as a contin-
we surveyed discusses evaluation of evaluation. In-
uum of stimuli where previously perceived stimuli
1
See Colton (2008) for discussion on the roles of the pro- affect to the next one. This effect is called priming
grammer, program and perceiver in creative systems (see Henson 2009). To reduce the effect of priming

or to have it consistent one should either shuffle as broad as creative text generation.
the order in which the output is presented to the
annotators or keep it always the same. Priming is 4.1 Define the goals
especially in play in cases where annotators are to From the very early on, it is important to define
evaluate outputs produced by different systems. In what the goals of your system are (see Alnajjar and
such a case, output of a mediocre system might get Hämäläinen 2018; Jordanous 2012). Try to be as
greatly boosted when presented together with out- concrete and precise as possible at this step. Once
put by a bad systems. Nearly none of the papers we you have your goals clearly stated, it is easy to see
surveyed discussed this aspect of their evaluation the degree to which your implementation solution
setting. tries to achieve those goals and how much can
Both CC and NLP have still a long way to go be attributed to the method and how much to the
in order to reach to more sound human evaluation training data. After this, the evaluation parameters
practices. However, INLG is still a step closer will follow naturally from the goals you set for your
to scientific rigor as automated evaluation metrics system. This way, the evaluation questions do not
were commonly used together with human evalu- appear seemingly from nowhere but are motivated
ation, and sometimes as the only evaluation met- by your research goals and implementation.
ric (such as Bień et al. 2020), whereas ICCC had
4.2 Go concrete
several papers presenting work on creative NLG
without any evaluation at all (such as Agafonova People have an inbuilt need to understand anything
et al. 2020; Petac et al. 2020; Wright and Purver expressed in their language (see Veale 2016). This
2020). can lead easily into a situation, where annotators
The use of experts in evaluation is something that can read more into the evaluated output than what
should be taken under rigorous inspection in the your system was aware of. By using evaluation
future. Currently, there are contradicting studies on questions that are as concrete as possible you can
the topic indicating that consulting expert does have reduce the room for subjective interpretation (see
an effect in machine translation (Toral et al., 2018) Hämäläinen and Alnajjar 2019). For example, for
but not in poem generation (Lamb et al., 2017). a pun like Becoming a vegetarian is a big missed
However, this is a question that is very likely to steak asking the annotators Is this humorous? and
depend on the output that is to be evaluated and Is this humorous because the pun ”missed steak”
also on how the evaluation is conducted. sounds like ”mistake”? will result in different pos-
Human computer interaction research has some sible interpretations as the former question might
more established methodologies for conducting hu- let the annotators consider the generated joke funny
man studies (see Jacko 2012; Lazar et al. 2017 for reasons other than those intended by the gener-
such as cognitive walk-through (see Mahatody et al. ative system.
2010), human performance evaluation support sys- 4.3 Run some tests
tem (Ha et al., 2007) and user studies (see MacKen-
zie 2015). These established methodologies could As we have seen in this paper, the same concept
be taken into account when conducting evaluation can be evaluated through multiple different word-
of such an NLG system that calls for user interac- ings and it is not always clear that the annotators
tion. understand the questions in the same way as the
researchers intended. By running tests on your sur-
4 Advices for future evaluation vey in real life, you can get more direct feedback
than what you could get from annotators on Ama-
In this section, we outline how human evaluation zon Mechanical Turk. It is better to adjust your
of creative NLG systems should be conducted. We evaluation questions sooner than after running a
are not going to give an exact silver bullet frame- costly crowd-sourcing.
work to solve the problem, as the two fields are Furthermore, the final number of annotators you
not at the state yet where enough would be known need and how many samples you should evaluate
about human evaluation to state exactly how the depends on the evaluation task and setting. If you
evaluation needs to be conducted. Furthermore, get high diversity in answers in the test run, you
we do not believe that a single fixed framework is will probably need to have a larger number of an-
enough to capture everything necessary in a topic notators conducting the actual evaluation.

Testing is also a great way of seeing whether There have been many different evaluation methods
you are asking non-experts to evaluate things they including some unconventional ones such as critics’
consider too difficult or whether your questionnaire reviews and user testing. The most typical human
is too lengthy. You do not want your annotators evaluation method has been using a scaled survey,
to lose interest in the middle of the questionnaire typically on a 5 point scale.
and start annotating fast without paying too much While most of the papers surveyed had come
attention. up with their own evaluation metrics, the most
common parameters that have been evaluated were
4.4 Run multiple evaluations meaning, syntactic correctness, novelty, relevance
Human evaluation does not need to be a one time and emotional value. Although, the terms used to
thing conducted in a massive survey. You can run refer to these notions have not been the same.
multiple different evaluations such as preference Most of the papers did not justify why they had
based ones, 5 point scale ones and true and false evaluated certain parameters. Instead, the param-
statements to better understand the limitations of eters were usually just stated as though they were
your system and your human evaluation. The more an inarguable fact. It was more often than not the
evidence gathered by different evaluation methods case that the actual evaluation questions were not
you can show, the more confident you and other revealed.
researchers can be of the quality of your method. There was a lot of variation in the number of
samples taken from the system output and how
4.5 Report everything clearly
many annotators were used to conduct the evalu-
It is important to report the evaluation questions ation. Typically the numbers were rather small.
exactly as they were used, how the survey form was There was no discussion about the demographics
constructed including any instructions and wording of the annotators nor about what type of a bias it
used for the 5 point scale, and how the output was might have introduced.
presented (always in the same order or shuffled). Evaluation setups were never tested out before-
All these have an effect on the results. In software hand, even though other fields dealing with human
engineering, it is considered important to report surveys recommend testing your questionnaires.
any threats to the validity of the research (Feldt This means that it is impossible to tell what the an-
and Magazinius, 2010). The same should apply to notators really understood by the evaluation ques-
NLP and CC. One of the important threats to the tions.
validity of human evaluation is bias in the results.
We established some advices for future evalua-
Therefore, it is important to report and discuss what
tion, which include clearly defining the goal of the
kind of people participated in the evaluation survey.
generative system, asking questions as concrete as
4.6 Analyze your results possible, testing the evaluation setup, using multi-
ple different evaluation setups, reporting the entire
It is also important to dig deeper into the human evaluation process and potential biases clearly, and
evaluation results. If you as a researcher put a con- finally analyzing the evaluation results in a more
siderable amount of money in getting your human profound way than merely reporting the most typi-
evaluation results, you should probably make the cal statistics.
most out of them too. Instead of merely reporting
All in all, our fields, CC and NLP, have a lot
the typical stats (mean, mode, median, standard
to learn from other fields with longer traditions
deviation), why not looking into the best and worst
with human questionnaires in terms of conducting
performing output by the system as well and let
human evaluation. At the current stage, none of the
the human evaluation be a guide in a deeper error
papers we surveyed quite reached the same level
analysis? This can open up insightful directions for
of scientific rigor in their human evaluation as it is
future research.
to be expected in other fields of science. However,
5 Conclusions this is not to say that the work of the authors of
the papers we surveyed is inherently bad. This is
In this paper, we have surveyed papers presenting just to highlight the fact that more attention needs
work on creative natural language generation that to be paid in how human evaluation is conducted.
have been published in INLG 2020 and ICCC 2020. Quite often with creative text generation, human

judgment is the only viable metric to measure the tion. In Proceedings of the 13th International Con-
performance of a system. Human evaluation of ference on Natural Language Generation, pages 22–
28, Dublin, Ireland. Association for Computational
generated text has been conducted in the field of
Linguistics.
NLP already as early as in the 1960s (McDaniel
et al., 1967) it is a pity it has not caught up with the Margaret A Boden. 2007. Creativity in a nutshell.
rest of the development in the field. Think, 5(15):83–96.

Wändi Bruine de Bruin, Martine Baldassi, Bernd
Figner, Baruch Fischhoff, Lauren Fleishman, David
References Hardisty, Eric Johnson, Gideon Keren, Maria Kon-
Yana Agafonova, Alexey Tikhonov, and Ivan P nikova, and Irwin Levin. 2011. Framing effects in
Yamshchikov. 2020. Paranoid transformer: Read- surveys: How respondents make sense of the ques-
ing narrative of madness as computational approach tions we ask. In Perspectives on Framing, edited by
to creativity. In Eleventh International Conference Gideon Keren, 303–325. New-York, NY. Taylor and
on Computational Creativity: ICCC’20. Association Francis Group, Psychology Press.
for Computational Creativity.
Wändi Bruine de Bruin, Wilbert Van der Klaauw, Gior-
Imen Akermi, Johannes Heinecke, and Frédéric gio Topa, Julie S Downs, Baruch Fischhoff, and
Herledan. 2020. Tansformer based natural language Olivier Armantier. 2012. The effect of question
generation for question-answering. In Proceedings wording on consumers’ reported inflation expecta-
of the 13th International Conference on Natural Lan- tions. Journal of Economic Psychology, 33(4):749–
guage Generation, pages 349–359, Dublin, Ireland. 757.
Association for Computational Linguistics.
Lee Cheatley, Margareta Ackerman, Alison Pease, and
Khalid Alnajjar and Mika Hämäläinen. 2018. A Wendy Moncur. 2020. Co-creative songwriting for
master-apprentice approach to automatic creation of bereavement support. In Eleventh International
culturally satirical movie titles. In Proceedings of Conference on Computational Creativity: ICCC’20,
the 11th International Conference on Natural Lan- pages 33–41. Association for Computational Cre-
guage Generation, pages 274–283. ativity.

Khalid Alnajjar, Leo Leppänen, and Hannu Toivonen. Debbie Collins. 2003. Pretesting survey instruments:
2019. No time like the present: methods for gener- an overview of cognitive methods. Quality of life
ating colourful and factual multilingual news head- research, 12(3):229–238.
lines. In Proceedings of the 10th International Con-
ference on Computational Creativity. Association Simon Colton. 2008. Creativity versus the perception
for Computational Creativity. of creativity in computational systems. In AAAI
spring symposium: creative intelligent systems, vol-
Prithviraj Ammanabrolu, William Broniec, Alex ume 8.
Mueller, Jeremy Paul, and Mark Riedl. 2020. To-
ward automated quest generation in text-adventure John W Creswell. 1998. Qualitative inquiry and
games. In Proceedings of the 11th International research design: Choosing among five traditions.
Conference on Computational Creativity. Sage publications.

John F Bell. 1991. Big is not necessarily beautiful in Robert Feldt and Ana Magazinius. 2010. Validity
survey design: Measurement error and the apu sci- threats in empirical software engineering research-
ence survey. Journal of the Royal Statistical Society: an initial survey. In Seke, pages 374–379.
Series D (The Statistician), 40(3):291–300.
Sebastian Gehrmann, Tosin Adewumi, Karmanya Ag-
Anya Belz, Simon Mille, and David M. Howcroft. garwal, Pawan Sasanka Ammanamanchi, Aremu
2020. Disentangling the properties of human eval- Anuoluwapo, Antoine Bosselut, Khyathi Raghavi
uation methods: A classification system to support Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D.
comparability, meta-evaluation and reproducibility Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek,
testing. In Proceedings of the 13th International Chris Emezue, Varun Gangal, Cristina Garbacea,
Conference on Natural Language Generation, pages Tatsunori Hashimoto, Yufang Hou, Yacine Jernite,
183–194, Dublin, Ireland. Association for Computa- Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir
tional Linguistics. Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan,
Mounica Maddela, Khyati Mahajan, Saad Ma-
Jelke Bethlehem. 2010. Selection bias in web surveys. hamood, Bodhisattwa Prasad Majumder, Pedro Hen-
International Statistical Review, 78(2):161–188. rique Martins, Angelina McMillan-Major, Simon
Mille, Emiel van Miltenburg, Moin Nadeem, Shashi
Michał Bień, Michał Gilski, Martyna Maciejewska, Narayan, Vitaly Nikolaev, Rubungo Andre Niy-
Wojciech Taisner, Dawid Wisniewski, and Ag- ongabo, Salomey Osei, Ankur Parikh, Laura Perez-
nieszka Lawrynowicz. 2020. RecipeNLG: A cook- Beltrachini, Niranjan Ramesh Rao, Vikas Raunak,
ing recipes dataset for semi-structured text genera- Juan Diego Rodriguez, Sashank Santhanam, João

Sedoc, Thibault Sellam, Samira Shaikh, Anasta- Rik Henson. 2009. Priming. In Encyclopedia of Neu-
sia Shimorina, Marco Antonio Sobrevilla Cabezudo, roscience, volume 7. Academic Press.
Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi
Yang, Akhila Yerukola, and Jiawei Zhou. 2021. David M. Howcroft, Anya Belz, Miruna-Adriana
The gem benchmark: Natural language genera- Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad
tion, its evaluation and metrics. arXiv preprint Mahamood, Simon Mille, Emiel van Miltenburg,
arXiv:2011.11928. Sashank Santhanam, and Verena Rieser. 2020.
Twenty years of confusion in human evaluation:
Pablo Gervás. 2017. Template-free construction of NLG needs evaluation sheets and standardised def-
rhyming poems with thematic cohesion. In Proceed- initions. In Proceedings of the 13th International
ings of the Workshop on Computational Creativity Conference on Natural Language Generation, pages
in Natural Language Generation (CC-NLG 2017), 169–182, Dublin, Ireland. Association for Computa-
pages 21–28, Santiago de Compostela, Spain. Asso- tional Linguistics.
ciation for Computational Linguistics.
Nikolai Ilinykh and Simon Dobnik. 2020. When an im-
Hongyu Gong, Linfeng Song, and Suma Bhat. 2020. age tells a story: The role of visual and semantic in-
Rich syntactic and semantic information helps unsu- formation for generating paragraph descriptions. In
pervised text style transfer. In Proceedings of the Proceedings of the 13th International Conference
13th International Conference on Natural Language on Natural Language Generation, pages 338–348,
Generation, pages 113–119, Dublin, Ireland. Asso- Dublin, Ireland. Association for Computational Lin-
ciation for Computational Linguistics. guistics.
Julie A Jacko. 2012. Human computer interaction
Zerrin Asan Greenacre. 2016. The importance of selec-
handbook: Fundamentals, evolving technologies,
tion bias in internet surveys. Open Journal of Statis-
and emerging applications. CRC press.
tics, 6(03):397.
Anna Jordanous. 2012. A standardised procedure for
Jun Su Ha, Poong Hyun Seong, Myeong Soo Lee, and evaluating creative systems: Computational creativ-
Jin Hyuk Hong. 2007. Development of human per- ity evaluation based on what it is to be creative. Cog-
formance measures for human factors validation in nitive Computation, 4(3):246–279.
the advanced mcr of apr-1400. IEEE Transactions
on Nuclear Science, 54(6):2687–2700. Graham Kalton and Howard Schuman. 1982. The ef-
fect of the question on survey responses: A review.
Mika Hämäläinen. 2020. Generating Creative Journal of the Royal Statistical Society: Series A
Language-Theories, Practice and Evaluation. Uni- (General), 145(1):42–57.
versity of Helsinki.
Carolyn Lamb, Daniel G Brown, and Charles LA
Mika Hämäläinen and Khalid Alnajjar. 2019. Let’s Clarke. 2017. Incorporating novelty, meaning, reac-
face it. finnish poetry generation with aesthetics and tion and craft into computational poetry: a negative
framing. In Proceedings of the 12th International experimental result. In ICCC, pages 183–188.
Conference on Natural Language Generation, pages
290–300. Carolyn Lamb, Daniel G Brown, and Charles LA
Clarke. 2018. Evaluating computational creativity:
Mika Hämäläinen and Khalid Alnajjar. 2021. The great An interdisciplinary tutorial. ACM Computing Sur-
misalignment problem in human evaluation of NLP veys (CSUR), 51(2):1–34.
methods. In Proceedings of the Workshop on Hu-
man Evaluation of NLP Systems (HumEval), pages Maurice Langner. 2020. OMEGA : A probabilistic ap-
69–74, Online. Association for Computational Lin- proach to referring expression generation in a vir-
guistics. tual environment. In Proceedings of the 13th Inter-
national Conference on Natural Language Genera-
Mika Hämäläinen, Niko Partanen, Khalid Alnajjar, tion, pages 296–305, Dublin, Ireland. Association
Jack Rueter, and Thierry Poibeau. 2020. Automatic for Computational Linguistics.
dialect adaptation in finnish and its effect on per- Paul J Lavrakas. 2008. Sample size. In Encyclopedia
ceived creativity. In 11th International Conference of survey research methods. Sage publications.
on Computational Creativity (ICCC’20). Associa-
tion for Computational Creativity. Jonathan Lazar, Jinjuan Heidi Feng, and Harry
Hochheiser. 2017. Research methods in human-
Behnam Hedayatnia, Karthik Gopalakrishnan, computer interaction. Morgan Kaufmann.
Seokhwan Kim, Yang Liu, Mihail Eric, and
Dilek Hakkani-Tur. 2020. Policy-driven neural Joosung Lee. 2020. Stable style transformer: Delete
response generation for knowledge-grounded dialog and generate approach with encoder-decoder for text
systems. In Proceedings of the 13th International style transfer. In Proceedings of the 13th Inter-
Conference on Natural Language Generation, national Conference on Natural Language Genera-
pages 412–421, Dublin, Ireland. Association for tion, pages 195–204, Dublin, Ireland. Association
Computational Linguistics. for Computational Linguistics.

Russell V Lenth. 2001. Some practical guidelines for Janice M Morse. 1994. Designing funded qualitative
effective sample size determination. The American research. Sage Publications, Inc.
Statistician, 55(3):187–193.
Nikola I. Nikolov, Eric Malmi, Curtis Northcutt, and
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Loreto Parisi. 2020. Rapformer: Conditional rap
Michel Galley, and Jianfeng Gao. 2016. Deep rein- lyrics generation with denoising autoencoders. In
forcement learning for dialogue generation. In Pro- Proceedings of the 13th International Conference
ceedings of the 2016 Conference on Empirical Meth- on Natural Language Generation, pages 360–373,
ods in Natural Language Processing, pages 1192– Dublin, Ireland. Association for Computational Lin-
1202. guistics.

Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Jason Obeid and Enamul Hoque. 2020. Chart-to-text:
Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Lin- Generating natural language descriptions for charts
jun Shou, Ming Gong, et al. 2020. Glge: A new by adapting the transformer model. In Proceedings
general language generation evaluation benchmark. of the 13th International Conference on Natural Lan-
arXiv preprint arXiv:2011.11928. guage Generation, pages 138–147, Dublin, Ireland.
Association for Computational Linguistics.
I Scott MacKenzie. 2015. User studies and usability
evaluations: from research to products. In Graphics Hugo Gonçalo Oliveira. 2020. Weirdanalogymatic:
Interface, pages 1–8. Experimenting with analogy for lyrics transforma-
tion. In Eleventh International Conference on Com-
Thomas Mahatody, Mouldi Sagar, and Christophe putational Creativity: ICCC’20.
Kolski. 2010. State of the art on the cogni-
tive walkthrough method, its variants and evolu- Alison Pease and Simon Colton. 2011. On impact and
tions. Intl. Journal of Human–Computer Interac- evaluation in computational creativity: A discussion
tion, 26(8):741–785. of the turing test and an alternative proposal. In Pro-
ceedings of the AISB symposium on AI and Philoso-
Kory Mathewson, Pablo Samuel Castro, Colin Cherry, phy, volume 39.
George Foster, and Marc G Bellemare. 2020. Shap-
ing the narrative arc: Information-theoretic collabo- Andreea-Oana Petac, Anne-Gwenn Bosser, Fred
rative dialogue. Charles, Pierre De Loor, and Marc Cavazza. 2020.
A pragmatics-based model for narrative dialogue
J. McDaniel, W.L. Price, A.J.M. Szanser, and D.M. generation. In Eleventh International Conference on
Yates. 1967. An evaluation of the usefulness of ma- Computational Creativity: ICCC’20.
chine translations produced at the national physical
laboratory, teddington, with a summary of the trans- Stanley Presser, Mick P Couper, Judith T Lessler, Eliz-
lation methods. In COLING 1967 Volume 1: Confer- abeth Martin, Jean Martin, Jennifer M Rothgeb, and
ence Internationale Sur Le Traitement Automatique Eleanor Singer. 2004. Methods for testing and eval-
Des Langues. uating survey questions. Public opinion quarterly,
68(1):109–130.
Rui Mendes and Hugo Gonçalo Oliveira. 2020a. Am-
plifying the range of news stories with creativity: Matthew Purver, Pablo Gervás, and Sascha Griffiths,
Methods and their evaluation, in Portuguese. In editors. 2016. Proceedings of the INLG 2016 Work-
Proceedings of the 13th International Conference shop on Computational Creativity in Natural Lan-
on Natural Language Generation, pages 252–262, guage Generation. Association for Computational
Dublin, Ireland. Association for Computational Lin- Linguistics, Edinburgh, UK.
guistics.
Richard Savery, Lisa Zahray, and Gil Weinberg. 2020.
Rui Mendes and Hugo Gonçalo Oliveira. 2020b. Com- Shimon the rapper: A real-time system for human-
paring different methods for assigning portuguese robot interactive rap battles. In Eleventh Inter-
proverbs to news headlines. In Eleventh Inter- national Conference on Computational Creativity:
national Conference on Computational Creativity: ICCC’20.
ICCC’20.
Thomas Scialom, Patrick Bordes, Paul-Alexis Dray, Ja-
Rui Mendes and Hugo Gonçalo Oliveira. 2020c. Teco: copo Staiano, and Patrick Gallinari. 2020. What
Exploring word embeddings for text adaptation to BERT sees: Cross-modal transfer for visual question
a given context. Eleventh International Conference generation. In Proceedings of the 13th International
on Computational Creativity: ICCC’20. Conference on Natural Language Generation, pages
327–337, Dublin, Ireland. Association for Computa-
Piotr Mirowski, Kory Mathewson, Boyd Branch, tional Linguistics.
Thomas Winters, Ben Verhoeven, and Jenny Elfv-
ing. 2020. Rosetta code: Improv in any language. Juliana Shihadeh and Margareta Ackerman. 2020.
In Proceedings of the 11th International Conference Emily: An emily dickinson machine. In Eleventh
on Computational Creativity, pages 115–122. Asso- International Conference on Computational Creativ-
ciation for Computational Creativity. ity: ICCC’20.

Brad Spendlove and Dan Ventura. 2020. Creating six-
  word stories via inferred linguistic and semantic for-
  mats. In Proceedings of the 11th International Con-
  ference on Computational Creativity, under review.
Susan J Thomas. 2004. Pilot testing the questionnaire.
  Using web and paper questionnaires for Data-Based
  Decision Making.
Jukka Toivanen, Hannu Toivonen, Alessandro Valitutti,
  and Oskar Gross. 2012. Corpus-based generation
  of content and form in poetry. In Proceedings of
  the third international conference on computational
  creativity. University College Dublin.
Antonio Toral, Sheila Castilho, Ke Hu, and Andy
  Way. 2018. Attaining the unattainable? reassessing
  claims of human parity in neural machine translation.
  In Proceedings of the Third Conference on Machine
  Translation: Research Papers, pages 113–123.
Bradley Tyler, Katherine Wilsdon, and Paul Bodily.
  2020. Computational humor: Automated pun gener-
  ation. In Eleventh International Conference on Com-
  putational Creativity: ICCC’20.
Tony Veale. 2016. 3. The shape of tweets to come: Au-
  tomating language play in social networks, pages
  73–92. De Gruyter Mouton.
Qingyun Wang, Qi Zeng, Lifu Huang, Kevin Knight,
  Heng Ji, and Nazneen Fatema Rajani. 2020. Re-
  viewRobot: Explainable paper review generation
  based on knowledge synthesis. In Proceedings of
  the 13th International Conference on Natural Lan-
  guage Generation, pages 384–397, Dublin, Ireland.
  Association for Computational Linguistics.
G Wright and Matthew Purver. 2020. Creative lan-
  guage generation in a society of engagement and
  reflection. In Proceedings of the Eleventh Interna-
  tional Conference on Computational Creativity. As-
  sociation for Computational Creativity (ACC).

                                                           95

You can also read