Natural Language Generation as Planning Under Uncertainty for Spoken Dialogue Systems

Page created by Deborah Little
 
CONTINUE READING
Natural Language Generation as Planning Under Uncertainty for Spoken
                         Dialogue Systems

                   Verena Rieser                                       Oliver Lemon
                School of Informatics                               School of Informatics
               University of Edinburgh                             University of Edinburgh
             vrieser@inf.ed.ac.uk                                 olemon@inf.ed.ac.uk

                     Abstract                                   In extending this previous work we treat NLG
    We present and evaluate a new model for                  as a statistical sequential planning problem, anal-
    Natural Language Generation (NLG) in                     ogously to current statistical approaches to Dia-
    Spoken Dialogue Systems, based on statis-                logue Management (DM), e.g. (Singh et al., 2002;
    tical planning, given noisy feedback from                Henderson et al., 2008; Rieser and Lemon, 2008a)
    the current generation context (e.g. a user              and “conversation as action under uncertainty”
    and a surface realiser). We study its use in             (Paek and Horvitz, 2000).         In NLG we have
    a standard NLG problem: how to present                   similar trade-offs and unpredictability as in DM,
    information (in this case a set of search re-            and in some systems the content planning and DM
    sults) to users, given the complex trade-                tasks are overlapping. Clearly, very long system
    offs between utterance length, amount of                 utterances with many actions in them are to be
    information conveyed, and cognitive load.                avoided, because users may become confused or
    We set these trade-offs by analysing exist-              impatient, but each individual NLG action will
    ing MATCH data. We then train a NLG pol-                 convey some (potentially) useful information to
    icy using Reinforcement Learning (RL),                   the user. There is therefore an optimization prob-
    which adapts its behaviour to noisy feed-                lem to be solved. Moreover, the user judgements
    back from the current generation context.                or next (most likely) action after each NLG action
    This policy is compared to several base-                 are unpredictable, and the behaviour of the surface
    lines derived from previous work in this                 realizer may also be variable (see Section 6.2).
    area. The learned policy significantly out-                 NLG could therefore fruitfully be approached
    performs all the prior approaches.                       as a sequential statistical planning task, where
                                                             there are trade-offs and decisions to make, such as
1   Introduction                                             whether to choose another NLG action (and which
Natural language allows us to achieve the same               one to choose) or to instead stop generating. Re-
communicative goal (“what to say”) using many                inforcement Learning (RL) allows us to optimize
different expressions (“how to say it”). In a Spo-           such trade-offs in the presence of uncertainty, i.e.
ken Dialogue System (SDS), an abstract commu-                the chances of achieving a better state, while en-
nicative goal (CG) can be generated in many dif-             gaging in the risk of choosing another action.
ferent ways. For example, the CG to present                     In this paper we present and evaluate a new
database results to the user can be realized as a            model for NLG in Spoken Dialogue Systems as
summary (Polifroni and Walker, 2008; Demberg                 planning under uncertainty. In Section 2 we argue
and Moore, 2006), or by comparing items (Walker              for applying RL to NLG problems and explain the
et al., 2004), or by picking one item and recom-             overall framework. In Section 3 we discuss chal-
mending it to the user (Young et al., 2007).                 lenges for NLG for Information Presentation. In
   Previous work has shown that it is useful to              Section 4 we present results from our analysis of
adapt the generated output to certain features of            the MATCH corpus (Walker et al., 2004). In Sec-
the dialogue context, for example user prefer-               tion 5 we present a detailed example of our pro-
ences, e.g. (Walker et al., 2004; Demberg and                posed NLG method. In Section 6 we report on
Moore, 2006), user knowledge, e.g. (Janarthanam              experimental results using this framework for ex-
and Lemon, 2008), or predicted TTS quality, e.g.             ploring Information Presentation policies. In Sec-
(Nakatsu and White, 2006).                                   tion 7 we conclude and discuss future directions.
              Proceedings of the 12th Conference of the European Chapter of the ACL, pages 683–691,
             Athens, Greece, 30 March – 3 April 2009. c 2009 Association for Computational Linguistics

                                                       683
2    NLG as planning under uncertainty                       as classifier learning and re-ranking, e.g. (Stent et
                                                             al., 2004; Oh and Rudnicky, 2002). Supervised
We adopt the general framework of NLG as plan-
                                                             approaches involve the ranking of a set of com-
ning under uncertainty (see (Lemon, 2008) for the
                                                             pleted plans/utterances and as such cannot adapt
initial version of this approach). Some aspects of
                                                             online to the context or the user. Reinforcement
NLG have been treated as planning, e.g. (Koller
                                                             Learning (RL) provides a principled, data-driven
and Stone, 2007; Koller and Petrick, 2008), but
                                                             optimisation framework for our type of planning
never before as statistical planning.
                                                             problem (Sutton and Barto, 1998).
   NLG actions take place in a stochastic environ-
ment, for example consisting of a user, a realizer,
                                                             3   The Information Presentation Problem
and a TTS system, where the individual NLG ac-
tions have uncertain effects on the environment.             We will tackle the well-studied problem of Infor-
For example, presenting differing numbers of at-             mation Presentation in NLG to show the benefits
tributes to the user, and making the user more or            of this approach. The task here is to find the best
less likely to choose an item, as shown by (Rieser           way to present a set of search results to a user
and Lemon, 2008b) for multimodal interaction.                (e.g. some restaurants meeting a certain set of con-
   Most SDS employ fixed template-based gener-               straints). This is a task common to much prior
ation. Our goal, however, is to employ a stochas-            work in NLG, e.g. (Walker et al., 2004; Demberg
tic realizer for SDS, see for example (Stent et al.,         and Moore, 2006; Polifroni and Walker, 2008).
2004). This will introduce additional noise, which
                                                                In this problem, there there are many decisions
higher level NLG decisions will need to react
                                                             available for exploration. For instance, which pre-
to. In our framework, the NLG component must
                                                             sentation strategy to apply (NLG strategy selec-
achieve a high-level Communicative Goal from
                                                             tion), how many attributes of each item to present
the Dialogue Manager (e.g. to present a number
                                                             (attribute selection), how to rank the items and at-
of items) through planning a sequence of lower-
                                                             tributes according to different models of user pref-
level generation steps or actions, for example first
                                                             erences (attribute ordering), how many (specific)
to summarize all the items and then to recommend
                                                             items to tell them about (conciseness), how many
the highest ranking one. Each such action has un-
                                                             sentences to use when doing so (syntactic plan-
predictable effects due to the stochastic realizer.
                                                             ning), and which words to use (lexical choice) etc.
For example the realizer might employ 6 attributes
                                                             All these parameters (and potentially many more)
when recommending item i4 , but it might use only
                                                             can be varied, and ideally, jointly optimised based
2 (e.g. price and cuisine for restaurants), depend-
                                                             on user judgements.
ing on its own processing constraints (see e.g. the
                                                                We had two corpora available to study some of
realizer used to collect the MATCH project data).
                                                             the regions of this decision space. We utilise the
Likewise, the user may be likely to choose an item
                                                             MATCH corpus (Walker et al., 2004) to extract an
after hearing a summary, or they may wish to hear
                                                             evaluation function (also known as ”reward func-
more. Generating appropriate language in context
                                                             tion”) for RL. Furthermore, we utilise the SPaRKy
(e.g. attributes presented so far) thus has the fol-
                                                             corpus (Stent et al., 2004) to build a high quality
lowing important features in general:
                                                             stochastic realizer. Both corpora contain data from
    • NLG is goal driven behaviour                           “overhearer” experiments targeted to Information
                                                             Presentation in dialogues in the restaurant domain.
    • NLG must plan a sequence of actions
                                                             While we are ultimately interested in how hearers
    • each action changes the environment state or           engaged in dialogues judge different Information
      context                                                Presentations, results from overhearers are still di-
    • the effect of each action is uncertain.                rectly relevant to the task.

   These facts make it clear that the problem of             4   MATCH     corpus analysis
planning how to generate an utterance falls nat-
urally into the class of statistical planning prob-          The MATCH project made two data sets available,
lems, rather than rule-based approaches such as              see (Stent et al., 2002) and (Whittaker et al., 2003),
(Moore et al., 2004; Walker et al., 2004), or super-         which we combine to define an evaluation function
vised learning as explored in previous work, such            for different Information Presentation strategies.

                                                       684
strategy          example                                                           av.#attr       av.#sentence
  SUMMARY           “The 4 restaurants differ in food quality, and cost.”             2.07±.63          1.56±.5
                    (#attr = 2, #sentence = 1)
  COMPARE           “Among the selected restaurants, the following offer               3.2±1.5          5.5±3.11
                    exceptional overall value. Aureole’s price is 71 dol-
                    lars. It has superb food quality, superb service and
                    superb decor. Daniel’s price is 82 dollars. It has su-
                    perb food quality, superb service and superb decor.”
                    (#attr = 4, #sentence = 5)
  RECOMMEND         “Le Madeleine has the best overall value among the                  2.4±.7           3.5±.53
                    selected restaurants. Le Madeleine’s price is 40 dol-
                    lars and It has very good food quality. It’s in Mid-
                    town West. ” (#attr = 3, #sentence = 3)

Table 1: NLG strategies present in the MATCH corpus with average no. attributes and sentences as found
in the data.

   The first data set, see (Stent et al., 2002), com-          .34). 1
prises 1024 ratings by 16 subjects (where we only
use the speech-based half, n = 512) on the follow-             score = .775 × #attr + (−.301) × #sentence;
ing presentation strategies: RECOMMEND, COM -                                                                     (1)
PARE, SUMMARY . These strategies are realized                     The second MATCH data set, see (Whittaker et
using templates as in Table 2, and varying num-                al., 2003), comprises 1224 ratings by 17 subjects
bers of attributes. In this study the users rate the           on the NLG strategies RECOMMEND and COM -
individual presentation strategies as significantly            PARE. The strategies realise varying numbers of
different (F (2) = 1361, p < .001). We find that               attributes according to different “conciseness” val-
SUMMARY is rated significantly worse (p = .05                  ues: concise (1 or 2 attributes), average (3
with Bonferroni correction) than RECOMMEND                     or 4), and verbose (4,5, or 6). Overhearers
and COMPARE, which are rated as equally good.                  rate all conciseness levels as significantly different
   This suggests that one should never generate                (F (2) = 198.3, p < .001), with verbose rated
a SUMMARY. However, SUMMARY has different                      highest and concise rated lowest, supporting
qualities from COMPARE and RECOMMEND, as                       our findings in the first data set. However, the rela-
it gives users a general overview of the domain,               tion between number of attributes and user ratings
and probably helps the user to feel more confi-                is not strictly linear: ratings drop for #attr = 6.
dent when choosing an item, especially when they               This suggests that there is an upper limit on how
are unfamiliar with the domain, as shown by (Po-               many attributes users like to hear. We expect this
lifroni and Walker, 2008).                                     to be especially true for real users engaged in ac-
   In order to further describe the strategies, we ex-         tual dialogue interaction, see (Winterboer et al.,
tracted different surface features as present in the           2007). We therefore include “cognitive load” as a
data (e.g. number of attributes realised, number of            variable when training the policy (see Section 6).
sentences, number of words, number of database                    In addition to the trade-off between length and
items talked about, etc.) and performed a step-                informativeness for single NLG strategies, we are
wise linear regression to find the features which              interested whether this trade-off will also hold for
were important to the overhearers (following the               generating sequences of NLG actions. (Whittaker
PARADISE framework (Walker et al., 2000)). We
                                                               et al., 2002), for example, generate a combined
discovered a trade-off between the length of the ut-           strategy where first a SUMMARY is used to de-
terance (#sentence) and the number of attributes               scribe the retrieved subset and then they RECOM -
                                                               MEND one specific item/restaurant. For example
realised (#attr), i.e. its informativeness, where
overhearers like to hear as many attributes as pos-            “The 4 restaurants are all French, but differ in
sible in the most concise way, as indicated by                    1
                                                                    For comparison: (Walker et al., 2000) report on R2 be-
the regression model shown in Equation 1 (R2 =                 tween .4 and .5 on a slightly lager data set.

                                                         685
Figure 1: Possible NLG policies (X=stop generation)

food quality, and cost. Le Madeleine has the best             describe a worked through example of the overall
overall value among the selected restaurants. Le              framework.
Madeleine’s price is 40 dollars and It has very
good food quality. It’s in Midtown West.”                     5   Method: the RL-NLG model
   We therefore extend the set of possible strate-
                                                              For the reasons discussed above, we treat the
gies present in the data for exploration: we allow
                                                              NLG module as a statistical planner, operat-
ordered combinations of the strategies, assuming
                                                              ing in a stochastic environment, and optimise
that only COMPARE or RECOMMEND can follow a
                                                              it using Reinforcement Learning.            The in-
SUMMARY , and that only RECOMMEND can fol-
                                                              put to the module is a Communicative Goal
low COMPARE, resulting in 7 possible actions:
                                                              supplied by the Dialogue Manager. The CG
                                                              consists of a Dialogue Act to be generated,
  1. RECOMMEND
                                                              for example present items(i1 , i2 , i5 , i8 ),
  2. COMPARE                                                  and a System Goal (SysGoal) which is
  3. SUMMARY                                                  the desired user reaction, e.g. to make the
                                                              user choose one of the presented items
  4. COMPARE+RECOMMEND                                        (user choose one of(i1 , i2 , i5 , i8 )).        The
  5. SUMMARY+RECOMMEND                                        RL-NLG module must plan a sequence of lower-
                                                              level NLG actions that achieve the goal (at lowest
  6. SUMMARY+COMPARE                                          cost) in the current context. The context consists
  7. SUMMARY+COMPARE+RECOMMEND                                of a user (who may remain silent, supply more
                                                              constraints, choose an item, or quit), and variation
   We then analytically solved the regression                 from the sentence realizer described above.
model in Equation 1 for the 7 possible strategies                Now let us walk-through one simple ut-
using average values from the MATCH data. This is             terance plan as carried out by this model,
solved by a system of linear inequalities. Accord-            as shown in Table 2.               Here, we start
ing to this model, the best ranking strategy is to            with the CG present items(i1 , i2 , i5 , i8 )&
do all the presentation strategies in one sequence,           user choose one of(i1 , i2 , i5 , i8 ) from the
i.e. SUMMARY+COMPARE+RECOMMEND. How-                          system’s DM. This initialises the NLG state (init).
ever, this analytic solution assumes a “one-shot”             The policy chooses the action SUMMARY and this
generation strategy where there is no intermediate            transitions us to state s1, where we observe that
feedback from the environment: users are simply               4 attributes and 1 sentence have been generated,
static overhearers (they cannot “barge-in” for ex-            and the user is predicted to remain silent. In this
ample), there is no variation in the behaviour of the         state, the current NLG policy is to RECOMMEND
surface realizer, i.e. one would use fixed templates          the top ranked item (i5 , for this user), which takes
as in MATCH, and the user has unlimited cogni-                us to state s2, where 8 attributes have been gener-
tive capabilities. These assumptions are not real-            ated in a total of 4 sentences, and the user chooses
istic, and must be relaxed. In the next Section we            an item. The policy holds that in states like s2 the

                                                        686
ACTIONS:       summarise                 recommend                    stop

                  GOAL                               s1                      s2                    end
                           init

                         ENVIRONMENT:         atts=4                       atts=8                   Reward
                                              user=silent                  user=choose

                                  Figure 2: Example RL-NLG action sequence for Table 4

State   Action                                                                                        State change/effect
init    SysGoal: present items(i1 , i2 , i5 , i8 )& user choose one of(i1 , i2 , i5 , i8 )               initialise state
s1      RL-NLG: SUMMARY(i1 , i2 , i5 , i8 )                                                        att=4, sent=1, user=silent
s2      RL-NLG: RECOMMEND(i5 )                                                                  att=8, sent=4, user=choose(i5 )
end     RL-NLG: stop                                                                                   calculate Reward

                              Table 2: Example utterance planning sequence for Figure 2

  best thing to do is “stop” and pass the turn to the                 generator state. In other words, the NLG user sim-
  user. This takes us to the state end, where the total               ulation tells us what the user is most likely to do
  reward of this action sequence is computed (see                     next, if we were to stop generating now. It also
  Section 6.3), and used to update the NLG policy                     tells us the probability whether the user chooses
  in each of the visited state-action pairs via back-                 to “barge-in” after a system NLG action (by either
  propagation.                                                        choosing an item or providing more information).
                                                                         The user simulation for this study is a simple
  6     Experiments                                                   bi-gram model, which relates the number of at-
  We now report on a proof-of-concept study where                     tributes presented to the next likely user actions,
  we train our policy in a simulated learning envi-                   see Table 3. The user can either follow the goal
  ronment based on the results from the MATCH cor-                    provided by the DM (SysGoal), for example
  pus analysis in Section 4. Simulation-based RL                      choosing an item. The user can also do some-
  allows to explore unseen actions which are not in                   thing else (userElse), e.g. providing another
  the data, and thus less initial data is needed (Rieser              constraint, or the user can quit (userQuit).
  and Lemon, 2008b). Note, that we cannot directly                       For simplification, we discretise the number of
  learn from the MATCH data, as therefore we would                    attributes into concise-average-verbose,
  need data from an interactive dialogue. We are                      reflecting the conciseness values from the MATCH
  currently collecting such data in a Wizard-of-Oz                    data, as described in Section 4. In addition, we
  experiment.                                                         assume that the user’s cognitive abilities are lim-
                                                                      ited (“cognitive load”), based on the results from
  6.1     User simulation                                             the second MATCH data set in Section 4. Once the
  User simulations are commonly used to train                         number of attributes is more than the “magic num-
  strategies for Dialogue Management, see for ex-                     ber 7” (reflecting psychological results on short-
  ample (Young et al., 2007). A user simulation for                   term memory) (Baddeley, 2001)) the user is more
  NLG is very similar, in that it is a predictive model               likely to become confused and quit.
  of the most likely next user act. However, this user                   The probabilities in Table 3 are currently man-
  act does not actually change the overall dialogue                   ually set heuristics. We are currently conducting a
  state (e.g. by filling slots) but it only changes the               Wizard-of-Oz study in order to learn these proba-

                                                                687
bilities (and other user parameters) from real data.          contexts it might be desirable to generate all 9 at-
                                                              tributes, e.g. if the generated utterance is short.
              SysGoal      userElse       userQuit
 concise      20.0         60.0           20.0
                                                              Threshold-based approaches, in contrast, cannot
 average      60.0         20.0           20.0                (easily) reason with respect to the current content.
 verbose      20.0         20.0           60.0
                                                              6.4   State and Action Space
      Table 3: NLG bi-gram user simulation                    We now formulate the problem as a Markov De-
                                                              cision Process (MDP), relating states to actions.
6.2   Realizer model                                          Each state-action pair is associated with a transi-
The sequential NLG model assumes a realizer,                  tion probability, which is the probability of mov-
which updates the context after each generation               ing from state s at time t to state s0 at time t + 1 af-
step (i.e. after each single action). We estimate             ter having performed action a when in state s. This
the realiser’s parameters from the mean values we             transition probability is computed by the environ-
found in the MATCH data (see Table 1). For this               ment model (i.e. user and realizer), and explic-
study we first (randomly) vary the number of at-              itly captures noise/uncertainty in the environment.
tributes, whereas the number of sentences is fixed            This is a major difference to other non-statistical
(see Table 4). In current work we replace the re-             planning approaches. Each transition is also as-
alizer model with an implemented generator that               sociated with a reinforcement signal (or reward)
replicates the variation found in the SPaRKy real-            rt+1 describing how good the result of action a
izer (Stent et al., 2004).                                    was when performed in state s.
                                                                 The state space comprises 9 binary features rep-
                        #attr     #sentence                   resenting the number of attributes, 2 binary fea-
      SUMMARY           1 or 2        2                       tures representing the predicted user’s next ac-
      COMPARE           3 or 4        6                       tion to follow the system goal or quit, as well as
      RECOMMEND         2 or 3        3                       a discrete feature reflecting the number of sen-
                                                              tences generated so far, as shown in Figure 3.
           Table 4: Realizer parameters                       This results in 211 × 6 = 12, 288 distinct genera-
                                                              tion states. We trained the policy using the well
                                                              known SARSA algorithm, using linear function ap-
6.3   Reward function
                                                              proximation (Sutton and Barto, 1998). The policy
The reward function defines the final goal of the             was trained for 3600 simulated NLG sequences.
utterance generation sequence. In this experiment                In future work we plan to learn lower level deci-
the reward is a function of the various data-driven           sions, such as lexical adaptation based on the vo-
trade-offs as identified in the data analysis in Sec-         cabulary used by the user.
tion 4: utterance length and number of provided
attributes, as weighted by the regression model               6.5   Baselines
in Equation 1, as well as the next predicted user
                                                              We derive the baseline policies from Informa-
action. Since we currently only have overhearer
                                                              tion Presentation strategies as deployed by cur-
data, we manually estimate the reward for the
                                                              rent dialogue systems. In total we utilise 7 differ-
next most likely user act, to supplement the data-
                                                              ent baselines (B1-B7), which correspond to single
driven model. If in the end state the next most
                                                              branches in our policy space (see Figure 1):
likely user act is userQuit, the learner gets a
penalty of −100, userElse receives 0 reward,                  B1: RECOMMEND only, e.g. (Young et al., 2007)
and SysGoal gains +100 reward. Again, these
hand coded scores need to be refined by a more                B2: COMPARE only, e.g. (Henderson et al., 2008)
targeted data collection, but the other components            B3: SUMMARY only, e.g. (Polifroni and Walker,
of the reward function are data-driven.                           2008)
   Note that RL learns to “make compromises”
                                                              B4: SUMMARY followed by RECOMMEND, e.g.
with respect to the different trade-offs. For ex-
                                                                  (Whittaker et al., 2002)
ample, the user is less likely to choose an item
if there are more than 7 attributes, but the real-            B5: Randomly choosing between COMPARE and
izer can generate 9 attributes. However, in some                  RECOMMEND , e.g. (Walker et al., 2007)

                                                        688
                                                    n   o
                                         attributes  | 1 | -|9 | :  0,1 
                           SUMMARY                   n       o            
                 
                         COMPARE         
                                            sentence:  1-11                
                                                                            
                 action: 
                          
                                    
                                    state:           n     o              
                 
                         RECOMMEND       
                                            userGoal: 0,1
                                                                            
                                                                            
                           end
                                                     n     o              
                                                                          
                                             userQuit: 0,1

                                  Figure 3: State-Action space for RL-NLG

B6: Randomly choosing between all 7 outputs                      ment. Furthermore, the simulated environment
                                                                 allows us to replicate the results in the MATCH
B7: Always generating whole sequence, i.e.
                                                                 corpus (see Section 4) when only comparing sin-
    SUMMARY + COMPARE + RECOMMEND ,      as
                                                                 gle strategies: SUMMARY performs significantly
    suggested by the analytic solution (see
                                                                 worse, while RECOMMEND and COMPARE per-
    Section 4).
                                                                 form equally well.
6.6   Results
                                                                            policy   reward    (±std)
We analyse the test runs (n=200) using an ANOVA                             B1       99.1      (±129.6)
with a PostHoc T-Test (with Bonferroni correc-                              B2       90.9      (±142.2)
tion). RL significantly (p < .001) outperforms all                          B3       65.5      (±137.3)
baselines in terms of final reward, see Table 5. RL                         B4       176.0     (±154.1)
is the only policy which significantly improves the                         B5       95.9      (±144.9)
next most likely user action by adapting to features                        B6       168.8     (±165.3)
in the current context. In contrast to conventional                         B7       229.3     (±157.1)
approaches, RL learns to ‘control’ its environment                          RL       310.8     (±136.1)
according to the estimated transition probabilities
and the associated rewards.                                           Table 5: Evaluation Results (p < .001 )
   The learnt policy can be described as follows:
It either starts with SUMMARY or COMPARE af-
ter the init state, i.e. it learnt to never start with a         7   Conclusion
RECOMMEND . It stops generating after COMPARE
if the userGoal is (probably) reached (e.g. the                  We presented and evaluated a new model for Nat-
user is most likely to choose an item in the next                ural Language Generation (NLG) in Spoken Dia-
turn, which depends on the number of attributes                  logue Systems, based on statistical planning. After
generated), otherwise it goes on and generates a                 motivating and presenting the model, we studied
RECOMMEND . If it starts with SUMMARY , it al-                   its use in Information Presentation.
ways generates a COMPARE afterwards. Again, it                      We derived a data-driven model predicting
stops if the userGoal is (probably) reached, oth-                users’ judgements to different information presen-
erwise it generates the full sequence (which corre-              tation actions (reward function), via a regression
sponds to the analytic solution B7).                             analysis on MATCH data. We used this regression
   The analytic solution B7 performs second best,                model to set weights in a reward function for Re-
and significantly outperforms all the other base-                inforcement Learning, and so optimize a context-
lines (p < .01). Still, it is significantly worse                adaptive presentation policy. The learnt policy was
(p < .001) than the learnt policy as this ‘one-shot-             compared to several baselines derived from previ-
strategy’ cannot robustly and dynamically adopt to               ous work in this area, where the learnt policy sig-
noise or changes in the environment.                             nificantly outperforms all the baselines.
   In general, generating sequences of NLG ac-                      There are many possible extensions to this
tions rates higher than generating single actions                model, e.g. using the same techniques to jointly
only: B4 and B6 rate directly after RL and B7,                   optimise choosing the number of attributes, aggre-
while B1, B2, B3, B5 are all equally bad given                   gation, word choice, referring expressions, and so
our data-driven definition of reward and environ-                on, in a hierarchical manner.

                                                           689
We are currently collecting data in targeted                Tim Paek and Eric Horvitz. 2000. Conversation as
Wizard-of-Oz experiments, to derive a fully data-                action under uncertainty. In Proc. of the 16th Con-
                                                                 ference on Uncertainty in Artificial Intelligence.
driven training environment and test the learnt
policy with real users, following (Rieser and                  Joseph Polifroni and Marilyn Walker. 2008. Inten-
Lemon, 2008b).      The trained NLG strategy                      sional Summaries as Cooperative Responses in Di-
will also be integrated in an end-to-end statis-                  alogue Automation and Evaluation. In Proceedings
tical system within the CLASSiC project (www.                     of ACL.
classic-project.org).                                          Verena Rieser and Oliver Lemon. 2008a. Does this
                                                                 list contain what you were searching for? Learn-
Acknowledgments                                                  ing adaptive dialogue strategies for Interactive Ques-
                                                                 tion Answering. J. Natural Language Engineering,
The research leading to these results has received               15(1):55–72.
funding from the European Community’s Sev-
enth Framework Programme (FP7/2007-2013) un-                   Verena Rieser and Oliver Lemon. 2008b. Learn-
der grant agreement no. 216594 (CLASSiC project                  ing Effective Multimodal Dialogue Strategies from
project: www.classic-project.org) and                            Wizard-of-Oz data: Bootstrapping and Evaluation.
                                                                 In Proceedings of ACL.
from the EPSRC project no. EP/E019501/1.
                                                               S. Singh, D. Litman, M. Kearns, and M. Walker. 2002.
                                                                  Optimizing dialogue management with Reinforce-
References                                                        ment Learning: Experiments with the NJFun sys-
                                                                  tem. JAIR, 16:105–133.
A. Baddeley. 2001. Working memory and language:
  an overview. Journal of Communication Disorder,
  36(3):189–208.                                               Amanda Stent, Marilyn Walker, Steve Whittaker, and
                                                                Preetam Maloor. 2002. User-tailored generation for
Vera Demberg and Johanna D. Moore. 2006. Infor-                 spoken dialogue: an experiment. In In Proc. of IC-
  mation presentation in spoken dialogue systems. In            SLP.
  Proceedings of EACL.
                                                               Amanda Stent, Rashmi Prasad, and Marilyn Walker.
James Henderson, Oliver Lemon, and Kallirroi                    2004. Trainable sentence planning for complex in-
  Georgila. 2008. Hybrid reinforcement / supervised             formation presentation in spoken dialog systems. In
  learning of dialogue policies from fixed datasets.            Association for Computational Linguistics.
  Computational Linguistics (to appear).
                                                               R. Sutton and A. Barto. 1998. Reinforcement Learn-
Srinivasan Janarthanam and Oliver Lemon. 2008. User               ing. MIT Press.
   simulations for online adaptation and knowledge-
   alignment in Troubleshooting dialogue systems. In           Marilyn A. Walker, Candace A. Kamm, and Diane J.
   Proc. of SEMdial.                                            Litman. 2000. Towards developing general mod-
Alexander Koller and Ronald Petrick. 2008. Experi-              els of usability with PARADISE. Natural Language
  ences with planning for natural language generation.          Engineering, 6(3).
  In ICAPS.
                                                               Marilyn Walker, S. Whittaker, A. Stent, P. Maloor,
Alexander Koller and Matthew Stone. 2007. Sentence              J. Moore, M. Johnston, and G. Vasireddy. 2004.
  generation as planning. In Proceedings of ACL.                User tailored generation in the match multimodal di-
                                                                alogue system. Cognitive Science, 28:811–840.
Oliver Lemon. 2008. Adaptive Natural Language
  Generation in Dialogue using Reinforcement Learn-            Marilyn Walker, Amanda Stent, François Mairesse, and
  ing. In Proceedings of SEMdial.                               Rashmi Prasad. 2007. Individual and domain adap-
                                                                tation in sentence planning for dialogue. Journal of
Johanna Moore, Mary Ellen Foster, Oliver Lemon, and             Artificial Intelligence Research (JAIR), 30:413–456.
  Michael White. 2004. Generating tailored, com-
  parative descriptions in spoken dialogue. In Proc.
                                                               Steve Whittaker, Marilyn Walker, and Johanna Moore.
  FLAIRS.
                                                                  2002. Fish or Fowl: A Wizard of Oz evaluation
Crystal Nakatsu and Michael White. 2006. Learning                 of dialogue strategies in the restaurant domain. In
  to say it well: Reranking realizations by predicted             Proc. of the International Conference on Language
  synthesis quality. In Proceedings of ACL.                       Resources and Evaluation (LREC).

Alice Oh and Alexander Rudnicky. 2002. Stochastic              Stephen Whittaker, Marilyn Walker, and Preetam Mal-
  natural language generation for spoken dialog sys-              oor. 2003. Should i tell all? an experiment on
  tems. Computer, Speech & Language, 16(3/4):387–                 conciseness in spoken dialogue. In Proc. European
  407.                                                            Conference on Speech Processing (EUROSPEECH).

                                                         690
Andi Winterboer, Jiang Hu, Johanna D. Moore, and
  Clifford Nass. 2007. The influence of user tailoring
  and cognitive load on user performance in spoken
  dialogue systems. In Proc. of the 10th International
  Conference of Spoken Language Processing (Inter-
  speech/ICSLP).

SJ Young, J Schatzmann, K Weilhammer, and H Ye.
  2007. The Hidden Information State Approach to
  Dialog Management. In ICASSP 2007.

                                                         691
You can also read