On Hallucination and Predictive Uncertainty in Conditional Language Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
On Hallucination and Predictive Uncertainty in Conditional Language Generation Yijun Xiao, William Yang Wang University of California, Santa Barbara {yijunxiao,william}@cs.ucsb.edu Abstract translation (NMT) (Müller et al., 2019). These studies tackle hallucinations within a specific task Despite improvements in performances on dif- and give possible explanations of why hallucina- ferent natural language generation tasks, deep tions occur. For example, Rohrbach et al. (2018) neural models are prone to hallucinating facts attributes object hallucination in image caption- that are incorrect or nonexistent. Different hypotheses are proposed and examined sepa- ing to visual misclassification and over-reliance on rately for different tasks, but no systematic ex- language priors; Nie et al. (2019) believes hallu- planations are available across these tasks. In cination in neural surface realization comes from this study, we draw connections between hal- the misalignment between meaning representations lucinations and predictive uncertainty in con- and their corresponding references in the dataset; ditional language generation. We investigate Müller et al. (2019) claims that hallucinations in their relationship in both image captioning and NMT are mainly due to domain shift. data-to-text generation and propose a simple extension to beam search to reduce hallucina- We believe that there is a common theme across tion. Our analysis shows that higher predictive all the hallucination explanations in conditional uncertainty corresponds to a higher chance of NLG tasks: predictive uncertainty. In language hallucination. Epistemic uncertainty is more indicative of hallucination than aleatoric or to- generation, predictive uncertainty quantifies the en- tal uncertainties. It helps to achieve better re- tropy of the token probability distributions a model sults of trading performance in standard metric predicts. There are multiple sources of uncertainty. for less hallucination with the proposed beam Two major ones frequently studied are aleatoric and search variant. epistemic uncertainties, where the former comes from the data or measurements, and the latter is 1 Introduction concerned with the model. With recent progress in Bayesian neural networks (BNNs) (Hinton and Modern deep neural network models have brought Van Camp, 1993; Neal, 1995) and uncertainty quan- drastic improvements of generation quality mea- tification (Blundell et al., 2015; Gal and Ghahra- sured by standard metrics on different natural lan- mani, 2016; Lakshminarayanan et al., 2017), we guage generation (NLG) tasks. However, along are able to quantify both parts of predictive uncer- with these improvements, researchers find that neu- tainty in neural NLG. ral models are more prone to a phenomenon called hallucination, where models generate description This study draws connections between halluci- tokens that are not supported by the source inputs. nation and predictive uncertainty and empirically This phenomenon seriously damages the applicabil- investigates their relationship in image captioning ity of neural language generation models in practice and data-to-text generation tasks. We propose an where information accuracy is vital. uncertainty-aware beam search algorithm to reduce Hallucination has been observed in various con- the chance of hallucination by penalizing parts or ditional NLG tasks such as image captioning the entirety of the predictive uncertainty during (Rohrbach et al., 2018), data-to-text generation model decoding. We find that the choice of un- (Wiseman et al., 2017; Nie et al., 2019; Parikh certainty matters, and penalizing epistemic uncer- et al., 2020), abstractive summarization (Cao et al., tainty yields better results compared to penalizing 2018; Durmus et al., 2020), and neural machine aleatoric or total uncertainty. Our contributions are: 2734 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 2734–2744 April 19 - 23, 2021. ©2021 Association for Computational Linguistics
• We draw connections between hallucination Practically, it is hard to automatically deter- (c ) and predictive uncertainty across various con- mine the context-dependent set Vh i . Task-specific ditional natural language generation tasks and heuristics are often used to determine which to- empirically investigate their relationship. kens are hallucinated. In specific restrictive appli- cations, the context-dependent set can be relaxed to • We propose an uncertainty-aware beam a context-independent one to reduce the complexity search approach for hallucination reduction of determining hallucination. to demonstrate that lowering uncertainty can lead to less hallucination. 2.2 Relationship with Predictive Uncertainty • We show that uncertainty decomposition helps We use entropy to measure the predictive uncer- to achieve better trade-offs between hallucina- tainty in this work. The total uncertainty of predict- tion and performance. ing token yi is: 2 Hallucination and Predictive H(yi |ci ) Uncertainty =− X p(yi = v|ci ) log p(yi = v|ci ) 2.1 Hallucination Probability v∈V X In general, hallucination refers to the phenomenon =− p(yi = v|ci ) log p(yi = v|ci ) (c ) where the model generates false information not v∈V\Vh i supported by the input. For example, in the context X − p(yi = v|ci ) log p(yi = v|ci ) (3) of image captioning, hallucination can be defined (c ) v∈Vh i as generating captions that contain descriptions not present in the given image. Let (x, y) be the pair From Equation 3, we can see that there are two of variables at interest where x is some structured sources of uncertainty for the token predictions: data containing facts and y is a natural language one from the uncertainty of choosing suitable to- sentence based on the facts. The task is to learn the kens to describe the input; another from some un- conditional distribution of p(y|x) in order to gener- suitable tokens attaining considerable probability ate sentence y given any new input x. Most neural mass either by being confusing in the current con- approaches break the probability into a sequence text or due to an insufficiently trained system. of single token predictions: The second source of uncertainty is directly re- k Y lated to hallucination probability. Although no p(y|x) = p(y1 |x) p(yi |x, y1 , · · · , yi−1 ) (1) monotonic relationship can be derived, a near-zero i=2 hallucination probability requires a near-zero value of the second source of uncertainty. This obser- where {y1 , · · · , yk } is the collection of tokens in vation prompts us to investigate the relationship sentence y. We denote ci = {x, y1 , · · · , yi−1 } as between hallucination and predictive uncertainty in the context of the i-th prediction in the following practice. Intuitively, the higher the predictive uncer- sections for simplicity. tainty is, the more probable some of the probability Apparently, hallucination is context-dependent mass gets assigned to unsuitable tokens. which means we need to look at a certain context ci and determine whether the next token prediction 2.3 Uncertainty Decomposition (c ) yi is hallucinated or not. Let Vh i denote the set of There are often two types of uncertainties fre- tokens that are considered false information given quently mentioned in uncertainty quantification the current context ci and V the whole vocabulary. literature: epistemic and aleatoric uncertainty Consider a random sampling decoder where a to- (Der Kiureghian and Ditlevsen, 2009; Kendall and ken is generated based on the predicted categorical Gal, 2017; Depeweg et al., 2018). Epistemic un- distribution. i.e. Cat(|V|, p(yi |ci )). The probabil- certainty reflects the uncertainty on model weights, ity of hallucination at the current step is simply: and aleatoric uncertainty concerns inherent uncer- (c ) X tainty in the data or measurement. We are inter- P (yi ∈ Vh i ) = p(yi = v|ci ) (2) (ci ) ested in whether the relationship with hallucination v∈Vh is the same for both types of uncertainties. 2735
Token 1 Token 2 Token 3 epistemic uncertainties are calculated as: M Prediction 1 0.33 0.33 0.33 1 X ual (yi |ci ) = Hm (yi |ci ) (7) Prediction 2 M 0.33 0.33 0.33 m=1 Prediction 3 0.33 0.33 0.33 uep (yi |ci ) = H(yi |ci ) − ual (yi |ci ) (8) (a) where Hm (yi |ci ) and H(yi |ci ) are the entropy of pm (yi |ci ) and p(yi |ci ) respectively. Token 1 Token 2 Token 3 Intuitively, in the case of deep ensembles, Prediction 1 0.98 0.01 0.01 aleatoric uncertainty measures the average spread of all model predictions, while epistemic uncer- Prediction 2 0.01 0.98 0.01 tainty measures the agreement among all model Prediction 3 0.01 0.01 0.98 predictions. Examples with three possible tokens are illustrated in Figure 1. (b) 3 Case Study: Image Captioning Figure 1: Examples of predictions with (a) high aleatoric but low epistemic uncertainty; and (b) high In this section, we analyze image captioning mod- epistemic but low aleatoric uncertainty. els trained on MSCOCO (Chen et al., 2015) data set. Bayesian deep learning approaches (Blundell 3.1 Hallucination Probability at Different et al., 2015; Gal and Ghahramani, 2016; Lakshmi- Uncertainty Levels narayanan et al., 2017) are widely studied for un- The first question we want to investigate is whether certainty quantification with neural networks. Fol- hallucination probabilities change at different pre- lowing the notations in Section 2.2, the predictive dictive uncertainty levels. Some experimental set- distribution of p(yi |ci ) can be written as: tings are listed below. Model architecture We consider four different Z p(yi |ci ) = p(yi |ci , w)q(w)dw (4) w image captioning models: FC model (Rennie et al., 2017) where image features are used to initialize where w parameterizes the neural network that the RNN decoder; Att2In model from (Rennie makes predictions and q(w) denotes the approx- et al., 2017) applies attention on image features imate posterior distribution of the weights w given and feeds it into the decoder LSTM (Hochreiter the training data. Notice that if we fix the weights and Schmidhuber, 1997) cell gate; BUTD model w, H(yi |ci , w) represents the entropy that is un- from (Anderson et al., 2018) uses bottom-up atten- related to the uncertainty of the model weights. tion which operates at the level of objects and other Therefore the aleatoric part of the predictive uncer- salient image regions; Transformer model where tainty can be calculated with Eq(w) [H(yi |ci , w)]. transformers (Vaswani et al., 2017) are used in the The epistemic part of the uncertainty is the differ- encoder-decoder structure for generation. All mod- ence between the total and the aleatoric uncertainty els are implemented in the open source framework as shown below: by Luo et al. (2018)1 . Training We consider the same data split from ual (yi |ci ) = Eq(w) [H(yi |ci , w)] (5) (Karpathy and Fei-Fei, 2015). All models are uep (yi |ci ) = H(yi |ci ) − Eq(w) [H(yi |ci , w)] (6) trained with batch size 50 for 30 epochs with Adam optimizer (Kingma and Ba, 2014). Evaluations are In this study, the aleatoric and epistemic parts done on the Karpathy test set. of predictive uncertainty are estimated using Hallucination and uncertainty evaluation As deep ensembles (Lakshminarayanan et al., 2017). in (Rohrbach et al., 2018), synonyms for all pos- More concretely, denote the model predictions as sible MSCOCO objects are used to determine {pm (yi |ci )}M m=1 and the aggregated prediction as 1 PM p(yi |ci ) = M m=1 pm (yi = v|ci ), aleatoric and 1 https://github.com/ruotianluo/self-critical.pytorch 2736
Action hallucination % at uncertainty level Model ≤ 0.8 0.8 - 1.6 1.6 - 2.4 2.4 - 3.2 3.2 - 4.0 > 4.0 FC 0.00 0.00 2.27 12.86 15.71 31.03 Att2In 0.00 0.00 3.39 6.58 12.07 22.03 BUTD 0.00 2.94 1.92 12.77 17.24 25.53 Transformer 2.99 5.48 6.58 8.82 12.00 43.75 Table 1: Action hallucination percentages at different levels of predictive uncertainty. Action predictions with higher uncertainty are more prone to hallucination. 60 Correlation coefficient FC Model 50 Att2In epistemic aleatoric BUTD FC 0.313 0.299 hallucination % 40 Transformer BUTD 0.334 0.228 30 Att2In 0.360 0.268 Transformer 0.269 0.131 20 10 Table 2: Pearson correlation coefficients between hal- 0 lucination and epistemic/aleatoric uncertainty in image 0 1 2 3 4 5 captioning task. Epistemic uncertainty is more indica- predictive uncertainty tive of hallucination across four models. Figure 2: Object hallucination chance at different pre- dictive uncertainty levels. Higher predictive uncer- tagger2 and manually label whether they are suit- tainty corresponds to a higher level of hallucination per- able to describe the corresponding images. There centage across all models. are approximately 3500 generated captions contain- ing verbs, and 400 are annotated for each model. We refer to unsuitable verbs generated in the cap- whether an object generated by the captioning tions as action hallucinations. model is hallucinated. Hallucination probabilities Action predictions are binned according to their are calculated by binning all object token predic- uncertainty values, and the results are shown in tion entropy and counting the percentage of hallu- Table 1. We can observe that action tokens with cinated objects in each bin. higher predictive uncertainty are also more likely to be hallucinated. Noticeably, the transformer model 3.2 Results and Discussions also has a higher action hallucination rate at high Figure 2 shows the object hallucination percentages uncertainty levels. at different predictive uncertainty levels. At higher Examples of predictions with high and low un- uncertainty levels, the generated objects are more certainty Figure 3 shows some example images likely to be hallucinated. The results are consis- and their captions generated from a BUTD model tent across four different models. The transformer on the test set. The token predictions of interests model seems to have a higher hallucination chance and the corresponding uncertainty values are high- at high uncertainty levels than the other three mod- lighted in bold and italic, respectively. We observe els. However, this does not indicate Transformer that highly uncertain predictions often correspond models hallucinate more. In fact, the transformer to unusual textures, features resembling the pre- model has an overall lowest hallucination percent- dicted tokens, or blurred images. For example, age among all four models. Figure 3(b) shows a motorcycle covered in vines; Figure 3(d) shows candles in the background which Beyond object hallucination Aside from object resemble cakes; Figure 3(f) is blurred. hallucination, we also analyze verbs generated by the models to see whether a similar relationship Epistemic and aleatoric uncertainties As we holds for other types of token generations. The could decompose the total uncertainty into two same models and training procedures are adopted. parts, we are interested in which part is more in- We extract all present continuous tense verbs from dicative of hallucination. Table 2 shows the Pear- 2 the generated captions using spaCy part-of-speech https://spacy.io 2737
(a) a red and black motor- (b) a motorcycle (4.80) is (c) a bride and groom cut- (d) a woman holding a cycle (0.58) parked in a parked on a dock with a ting their wedding cake cup and a cake (5.29) parking lot bird perched on top of it (0.09) (e) a man standing on (f) a young man is hold- (g) a group of children (h) a man is eating (4.01) a tennis court holding ing (4.76) a skateboard in sitting at a table eating a hot dog at a restaurant (0.81) a racquet his hand (1.00) pizza Figure 3: Examples of token predictions generated with the BUTD model with high and low uncertainty values for objects (top) and actions (bottom). Numbers in italic are predictive uncertainty values for the token predictions preceding them. The examples are cherry-picked. son correlation coefficients between hallucination ing metadata, such as page title and section title. (binary) and epistemic/aleatoric uncertainty for all Candidate description texts are modified by anno- four models. We can see that both parts of un- tators to pair with each table. Relevant table cells certainty are weakly correlated with hallucination, supporting the description texts are highlighted by while epistemic uncertainty is more indicative of the annotators as well. There are 120,761 table- hallucination across all four models compared to text pairs in training, 7,700 in validation, and 7,700 aleatoric uncertainty. in test. We use the baseline standard linearization approach to represent the highlighted portions of 4 Case Study: Data-to-text Generation the tables along with their corresponding metadata Data-to-text generation (Kukich, 1983; McKeown, (referred to as subtable with metadata in (Parikh 1992) is a task to generate textual content condi- et al., 2020)). tioned on input content in the form of structured Model architecture and training We use a stan- data such as tables. Neural models are prone dard sequence-to-sequence model with attention to hallucination in data-to-text generation tasks (Bahdanau et al., 2015; Luo et al., 2018) for anal- compared to traditional template-based systems, ysis. LSTM with 512 hidden size is used for both and methods are proposed to improve faithfulness the encoder and the decoder. Adam optimizer with (Wiseman et al., 2017; Nie et al., 2019; Tian et al., learning rate 1e-3 is used for the optimization. The 2019). In this section, we discuss the relationship model is trained with cross-entropy loss for 20 between predictive uncertainty and hallucination in epochs. The checkpoint with the best validation data-to-text generation with ToTTo dataset (Parikh loss is chosen for the evaluation. The implementa- et al., 2020). tion is done using fairseq (Ott et al., 2019)3 . 4.1 Generation Quality and Average Evaluation We evaluate the average predictive Uncertainty uncertainty for all generated sentences in the val- We conduct token-level analysis in Section 3. Now idation set and select the top, bottom, and middle we take a different route and analyze sentence- 5% for comparison. BLEU score (Papineni et al., level quality with different average predictive un- 2002) is used as an automatic metric to evaluate the certainty values. Experiment settings are described similarity to the references; further manual annota- below. tions are done to evaluate the fluency, faithfulness (precision), and coverage with respect to reference Dataset ToTTo dataset consists of tables from 3 English Wikipedia articles with their correspond- https://github.com/pytorch/fairseq 2738
Unc. Level Avg Unc. BLEU Fluency (%) Faithfulness (%) Less/Neutral/More Coverage w.r.t. Ref High 1.83 - 3.74 10.2 46.0 41.3 79.4 / 15.9 / 04.7 Medium 0.83 - 0.89 31.5 87.3 78.9 35.2 / 47.9 / 16.9 Low 0.04 - 0.27 72.8 100.0 99.0 22.2 / 70.1 / 07.7 Table 3: Evaluation results for candidates with high, medium, and low average predictive uncertainty values for ToTTo validation set. Unc. denotes uncertainty. Higher uncertainty candidates have lower quality and higher chance of being hallucinated/unfaithful w.r.t. the input tables. (recall) of the generated sentences. Particularly, didates in Yt−1 form a set Ct = {y | yt−1 ∈ faithfulness reflects how likely the generated sen- Yt−1 ∧ yt ∈ V}. Beam at step t is then formed as: tences hallucinate facts that are not supported by B the tables. More details of the human evaluation X metrics are described in (Parikh et al., 2020). The Yt = arg max log p(yb |x) y1 ···yB ∈Ct b=1 goal is to measure how different the generation qualities are for candidates with varying average s.t. yi 6= yj ∀i 6= j (9) predictive uncertainties. Uncertainty-aware beam search (UABS) adds a 4.2 Results and Discussions weighted penalty term in the beam search objec- tive to balance between log probability and pre- Table 3 summarizes the evaluation results for candi- dictive uncertainty of the selected candidates. Let dates with varying uncertainty values. It is obvious u(y|x) be the function to measure the aggregated that candidates with higher average predictive un- predictive uncertainty of candidate y given input x, certainty values are less fluent and more likely to uncertainty-aware beam search updates the beam contain hallucinations. Another interesting obser- at step t according to the following equation: vation from Table 3 is that the generated sentences with medium average uncertainty are more likely B X (16.9%) to cover more table facts than the refer- Yt = arg max log p(yb |x) − λu(yb |x) ences compared to the ones with high (4.7%) and y1 ···yB ∈Ct b=1 low (7.7%) average uncertainty. One possible ex- s.t. yi 6= yj ∀i 6= j (10) planation is that some table facts that are not al- ways included in the references, when generated, where λ ≥ 0 is the weight controlling the degree have higher predictive uncertainty values than the to which we want to penalize decoding uncertainty. facts that are almost always included in the refer- Larger λ leads to candidates with smaller predic- ences. Therefore, generated sentences with low tive uncertainty. In practice, this can be done by uncertainty tend to include less but more confident subtracting the weighted uncertainty term from the facts considered by the model. aggregated log probability scores at each decoding step before choosing top-B candidates. 5 Reducing Hallucination An important decision in using uncertainty- aware beam search is the choice of uncertainty 5.1 Uncertainty-Aware Beam Search term u(y|x). We could use either the aleatoric or Because of the positive correlation between hallu- epistemic part of the predictive uncertainty or both. cination probability and predictive uncertainty, it is We compare these choices and discuss the results straightforward to incorporate uncertainty into the in the next section. caption generation process to reduce hallucination. Beam search is the most used approximate decod- 5.2 Image Captioning Results ing method in language generation. It keeps track With larger weights on the uncertainty penalty term, of the top-B scored candidates at each generation log probabilities of the decoded sentences drop. step and considers all single token extensions of Therefore, we expect to see a trade-off between the current candidates. the quality of generated captions and the chance of More formally, denote the set of B candidates in hallucination. (b) the beam at time step t − 1 as Yt−1 = {yt−1 }B b=1 . We empirically examine the trade-offs on the All possible single token extensions of the can- image captioning models with different uncertainty 2739
UABS results with weight λ Image 0 20 80 a vase filled with flowers a vase filled with lots of there is a vase that has sitting on top of a table white flowers flowers in it a wooden cutting board a wooden cutting board a cutting board that has topped with lots of food topped with lots of food a bunch on it Table 4: Two examples of epistemic UABS results with varying penalty weights on the image captioning data set. In the first example the model successfully avoids hallucination of a table with λ = 20 while in the second example it is unable to change the generated caption until larger penalty weight is set. 100 λ avg. len. # obj. hal. % gen. % 110.0 95 107.5 105.0 CIDEr CIDEr 90 epistemic epistemic ref. 10.44 6114 0 - 102.5 85 aleatoric 100.0 aleatoric base 0 9.31 7328 5.5 0 total 97.5 total 80 95.0 10 9.21 7195 5.2 0 7.0 7.5 8.0 8.5 9.0 4.6 4.8 5.0 5.2 5.4 5.6 5.8 20 9.16 7078 4.9 0.2 CHAIRi CHAIRi epist. 40 9.15 6912 4.2 1.5 (a) FC (b) Att2In 80 9.12 6493 3.6 4.6 110 110 0.1 9.32 7250 5.4 0 100 100 0.4 9.32 7051 5.1 0 CIDEr 90 CIDEr epistemic epistemic aleat. 80 90 1.0 9.33 6800 4.7 1.0 aleatoric aleatoric 70 total 80 4.0 9.43 4349 4.1 28.4 60 total 70 3.75 4.00 4.25 4.50 4.75 5.00 5.25 5.50 3.2 3.4 3.6 3.8 4.0 4.2 CHAIRi CHAIRi Table 5: Average sentence length and total number of (c) BUTD (d) Transformer objects detected in the captions generated by BUTD model with varying uncertainty penalty weight λ. Pe- Figure 4: CIDEr plotted against CHAIRi scores of cap- nalizing epistemic uncertainty leads to slightly shorter tions generated with UABS with different uncertainty lengths. Number of objects mentioned by the captions penalty weights. Lower CHAIRi score indicates less decreases with increasing λ. gen. % denotes percent- hallucination. Upper-left is better. Penalizing epis- age of generic responses. It is moderate with epistemic temic uncertainty in UABS achieves the best results. penalized results but can be very high if aleatoric uncer- tainty is heavily penalized. choices for the penalty term. We use a five-model an approach that is to the upper left of another ensemble for each of the four model architectures is better. As the penalty weight increases, we ob- to estimate aleatoric and epistemic uncertainties. serve a decrease in both the CHAIRi and the CIDEr Due to the different magnitudes of aleatoric and scores across all models. epistemic uncertainties, we choose penalty weight Table 4 shows two examples of different gener- λ from [0.1, 0.2, 0.4, 0.8, 1.0, 2.0, 4.0] for aleatoric ated captions using epistemic UABS with varying and total uncertainty and [10, 20, 40, 80] for epis- penalty weights. In the first example, we can see temic uncertainty. that a medium penalty weight of 20 not only helps Figure 4 shows the trade-offs between CIDEr avoid the hallucination of a table but also adds cor- (Vedantam et al., 2015) and CHAIRi (Rohrbach rect information about the color of the flowers. In et al., 2018) scores of captions generated with the second example, a medium penalty weight is uncertainty-aware beam search with different un- unable to change the generated caption. certainty choices and penalty weights. A smaller Regarding the choice of uncertainty, it is no- value of CHAIRi indicates the model is less likely table that when penalizing epistemic uncertainty, to generate hallucinated objects, and a higher the generated captions achieve higher CIDEr scores CIDEr indicates better caption quality. Therefore than penalizing aleatoric or total uncertainty. We 2740
λ BLEU Fluency (%) Faithfulness (%) Less/Neutral/More Coverage w.r.t. Ref 0 40.1 92 79 34 / 60 / 6 10 33.6 83 84 41 / 51 / 8 20 27.4 73 80 52 / 42 / 6 Table 6: Evaluation results for candidates decoded with different penalty weights for UABS on ToTTo validation set. Epistemic uncertainty is used for uncertainty penalization. Faithfulness first increases, then decreases to the same level as regular beam search results as we increase the penalty weight λ. UABS results with weight λ Reference 0 10 20 barrows scored 164 net points in virgin islands at the in virgin islands at the thomas barrows received in virgin islands at the 2008 2008 summer olympics, 2008 summer olympics, a total score of 164. summer olympics. barrows iii received 164 barrows received 164 points. points. janet gaynor won the first janet gaynor won the janet gaynor won the janet gaynor won an academy award for best actress academy award for best academy award for best academy award for best for her performance in the actress for his performance actress. actress. 7th heaven (1927 film). in janet gaynor. Table 7: Two examples of UABS results with varying penalty weights on the ToTTo validation set. Blue tokens are correct table facts that are dropped by candidates generated with larger penalty weights; red tokens are incor- rect/hallucinated facts that are dropped with larger penalty weights. In general, UABS with larger weights tend to produce sentences with less information that the model is more confident with. hypothesize that epistemic uncertainty indicates generic response rates low while achieving lower the uncertainty of model weights. By penalizing hallucination rates. epistemic uncertainty, we encourage the model to take the prediction path where it is well-calibrated. On the other hand, penalizing aleatoric uncertainty 5.3 Data-to-text Results encourages the model to make low entropy predic- tions in all contexts regardless of the actual data We also evaluate the effect of UABS on the ToTTo distributions. dataset. We choose to penalize epistemic uncer- Table 5 shows the average sentence length, the tainty due to its better performances than aleatoric number of objects, the percentage of hallucinations, uncertainty, as shown in the previous section. A and the percentage of generic responses in the cap- five-model deep ensemble is used to quantify the tions generated by the BUTD model with different epistemic uncertainty and generate results with uncertainty choices and penalty weights on the test UABS. We compare the BLEU score and three set. We can see that when penalizing epistemic un- human evaluation metrics among results generated certainty, UABS results in slightly shorter caption with different uncertainty penalty weights. 100 candidates. Both the number of objects and hal- generation results are randomly selected and eval- lucination percentage decrease as we increase the uated for each penalty weight choice. The results weight λ. Interestingly, when penalizing aleatoric are shown in Table 6. We can see that a relatively uncertainty, sentence length stays approximately small penalty weight leads to a reduced hallucina- the same despite lower CIDEr scores, as shown tion chance (hence more faithful) with a cost on in Figure 4. Further investigation shows that this the BLEU score and fluency. is partly due to an increasing number of generic To qualitatively examine the sentences generated captions such as “there is no image here to provide with different λ values, we show example results a caption for”. Penalizing epistemic uncertainty on the ToTTo validation set in Table 7. We can see is much less likely to result in such generic cap- that with larger penalty weights, the UABS results tions. We can see that when increasing λ from drop certain statements that the model deems less 1.0 to 4.0 with aleatoric UABS, the percentage of confident regardless of the correctness. This results generic responses jumps drastically from 1.0% to in shorter but more confident predictions for UABS 28.4%. In comparison, epistemic UABS keeps the results with a larger uncertainty penalty. 2741
6 Related Work of uncertainty quantification have been explored in the context of time series predictions (Zhu and Hallucination There are many pieces of anec- Laptev, 2017), natural language processing tasks dotal evidence of hallucination presented in var- (Xiao and Wang, 2019), etc. More broadly, predic- ious NLG tasks. Most recently, researchers tion entropy has been analyzed in different neural started investigating the phenomenon systemati- language generation tasks (Ott et al., 2018; Xu cally. Rohrbach et al. (2018) analyzes object hal- et al., 2020). Depeweg et al. (2018) shows how lucination focusing on the objects that appeared to extract and decompose uncertainty in Bayesian in the MSCOCO segmentation challenge. They neural networks with latent variables for decision- propose the CHAIR metric to quantify the severity making purposes. They show that active learning of object hallucination. They find that the models and risk-sensitive reinforcement learning both ben- tend to make predictions consistent with a language efit from uncertainty decomposition. model trained on the captions instead of a model trained to predict objects in an image. Therefore 7 Discussion and Conclusions hallucination is caused by an over-reliance on the language priors. Nie et al. (2019) believes that the We investigate the relationship between hallucina- origin of the hallucination problem in neural sur- tion and predictive uncertainty in image captioning face realization comes from the data side. More and data-to-text generation tasks and show that pre- specifically, datasets used for NLG systems often dictions with higher uncertainty are more prone to include instances with information misalignment hallucination. In particular, epistemic uncertainty between the input structure and the output text. is more indicative of hallucination than aleatoric They propose integrating a language understanding uncertainty. We propose uncertainty-aware beam module for iterative data refinement to better align search to incorporate uncertainty into the decoding meaning representations and output text. Müller process to reduce hallucination. We show that un- et al. (2019) examines hallucination in neural ma- certainty decomposition helps the proposed beam chine translation and observes that the phenomenon search variant to achieve a better performance- is most common in out-of-domain settings. They hallucination trade-off. Specifically, penalizing empirically compare several strategies to improve epistemic uncertainty yields better results com- domain robustness in NMT and find that a combi- pared to penalizing aleatoric or total uncertainty. nation of reconstruction and a noisy channel model In this work, we analyze uncertainty from the for reranking is most effective. token level. This might be restrictive because uncer- tainty corresponds to the current prediction context These observations are consistent with our find- instead of the predicted token. The relationship ings. For example, domain shift and data misalign- between hallucination and uncertainty, therefore, ment are known to lead to a higher level of epis- can be much more complicated than a linear one. It temic uncertainty (Kendall and Gal, 2017) which is still possible to produce hallucinated information makes hallucination a more severe problem. with a very confident model. The proposed UABS reduces hallucination by limiting the total uncer- Uncertainty quantification Uncertainty quan- tainty of the generated text. As a result, it might tification has attracted more attention recently due lead to shorter generations and lower generation to the progress in Bayesian deep learning. Bayes quality. Devising more sophisticated uncertainty- by backprop (Blundell et al., 2015), Monte Carlo aware training and decoding methods with less ad- dropout (Gal and Ghahramani, 2016), and deep en- verse effects on the generation quality is a future sembles (Lakshminarayanan et al., 2017) are exam- direction to explore. ples of popular Bayesian approaches to evaluate un- certainty with deep neural models. Kendall and Gal Acknowledgement (2017) investigates the benefits of modeling epis- temic and aleatoric uncertainty in vision tasks such We would like to gratefully acknowledge the sup- as semantic segmentation and depth regression. port of the National Science Foundation. The views They show that it is important to model aleatoric un- expressed are those of the author and do not reflect certainty with large datasets and real-time applica- the official policy or position of the US govern- tions and epistemic uncertainty with small datasets ment. and safety-critical applications. Other applications 2742
References Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- semantic alignments for generating image descrip- Peter Anderson, Xiaodong He, Chris Buehler, Damien tions. In Proceedings of the IEEE conference Teney, Mark Johnson, Stephen Gould, and Lei on computer vision and pattern recognition, pages Zhang. 2018. Bottom-up and top-down attention for 3128–3137. image captioning and visual question answering. In Proceedings of the IEEE conference on computer vi- Alex Kendall and Yarin Gal. 2017. What uncertainties sion and pattern recognition, pages 6077–6086. do we need in bayesian deep learning for computer vision? In Advances in neural information process- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- ing systems, pages 5574–5584. gio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd Inter- Diederik P Kingma and Jimmy Ba. 2014. Adam: A national Conference on Learning Representations, method for stochastic optimization. arXiv preprint ICLR 2015, San Diego, CA, USA, May 7-9, 2015, arXiv:1412.6980. Conference Track Proceedings. Karen Kukich. 1983. Design of a knowledge-based re- Charles Blundell, Julien Cornebise, Koray port generator. In Proceedings of the 21st annual Kavukcuoglu, and Daan Wierstra. 2015. Weight meeting on Association for Computational Linguis- uncertainty in neural networks. arXiv preprint tics, pages 145–150. Association for Computational arXiv:1505.05424. Linguistics. Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Balaji Lakshminarayanan, Alexander Pritzel, and Faithful to the original: Fact aware neural abstrac- Charles Blundell. 2017. Simple and scalable predic- tive summarization. In Proceedings of the AAAI tive uncertainty estimation using deep ensembles. In Conference on Artificial Intelligence, volume 32. Advances in neural information processing systems, pages 6402–6413. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ishna Vedantam, Saurabh Gupta, Piotr Dollár, and Ruotian Luo, Brian Price, Scott Cohen, and Gregory C Lawrence Zitnick. 2015. Microsoft coco captions: Shakhnarovich. 2018. Discriminability objective for Data collection and evaluation server. arXiv preprint training descriptive captions. In Proceedings of the arXiv:1504.00325. IEEE Conference on Computer Vision and Pattern Recognition, pages 6964–6974. Stefan Depeweg, Jose-Miguel Hernandez-Lobato, Fi- nale Doshi-Velez, and Steffen Udluft. 2018. Decom- Kathleen McKeown. 1992. Text generation. Cam- position of uncertainty in bayesian deep learning for bridge University Press. efficient and risk-sensitive learning. In International Mathias Müller, Annette Rios, and Rico Sennrich. Conference on Machine Learning, pages 1184–1193. 2019. Domain robustness in neural machine trans- PMLR. lation. arXiv preprint arXiv:1911.03109. Armen Der Kiureghian and Ove Ditlevsen. 2009. Radford M Neal. 1995. BAYESIAN LEARNING FOR Aleatory or epistemic? does it matter? Structural NEURAL NETWORKS. Ph.D. thesis, University of safety, 31(2):105–112. Toronto. Esin Durmus, He He, and Mona Diab. 2020. FEQA: A Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and question answering evaluation framework for faith- Chin-Yew Lin. 2019. A simple recipe towards re- fulness assessment in abstractive summarization. In ducing hallucination in neural surface realisation. In Proceedings of the 58th Annual Meeting of the Asso- Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5055– ciation for Computational Linguistics, pages 2673– 5070, Online. Association for Computational Lin- 2679. guistics. Myle Ott, Michael Auli, David Grangier, and Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Marc’Aurelio Ranzato. 2018. Analyzing uncer- bayesian approximation: Representing model uncer- tainty in neural machine translation. In Inter- tainty in deep learning. In international conference national Conference on Machine Learning, pages on machine learning, pages 1050–1059. 3956–3965. Geoffrey E Hinton and Drew Van Camp. 1993. Keep- Myle Ott, Sergey Edunov, Alexei Baevski, Angela ing the neural networks simple by minimizing the Fan, Sam Gross, Nathan Ng, David Grangier, and description length of the weights. In Proceedings of Michael Auli. 2019. fairseq: A fast, extensible the sixth annual conference on Computational learn- toolkit for sequence modeling. In Proceedings of ing theory, pages 5–13. NAACL-HLT 2019: Demonstrations. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Long short-term memory. Neural computation, Jing Zhu. 2002. Bleu: a method for automatic eval- 9(8):1735–1780. uation of machine translation. In Proceedings of 2743
the 40th annual meeting on association for compu- tational linguistics, pages 311–318. Association for Computational Linguistics. Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 2020. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373. Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7008–7024. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hal- lucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing, pages 4035–4045. Ran Tian, Shashi Narayan, Thibault Sellam, and Ankur P Parikh. 2019. Sticking to the facts: Con- fident decoding for faithful data-to-text generation. arXiv preprint arXiv:1910.08684. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image de- scription evaluation. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4566–4575. Sam Wiseman, Stuart M Shieber, and Alexander M Rush. 2017. Challenges in data-to-document gen- eration. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2253–2263. Yijun Xiao and William Yang Wang. 2019. Quanti- fying uncertainties in natural language processing tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7322–7329. Jiacheng Xu, Shrey Desai, and Greg Durrett. 2020. Un- derstanding neural abstractive summarization mod- els via uncertainty. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 6275–6281. Lingxue Zhu and Nikolay Laptev. 2017. Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Min- ing Workshops (ICDMW), pages 103–110. IEEE. 2744
You can also read