Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation Yasuhide Miura, Yuhao Zhang, Emily Bao Tsai, Curtis P. Langlotz, Dan Jurafsky Stanford University {ysmiura, zyh, ebtsai, langlotz, jurafsky}@stanford.edu Abstract Medical Images Reference Report Large right pleural effusion is unchanged in size. There is associated right basilar Neural image-to-text radiology report gener- atelectasis/scarring, also stable. Healed ation systems offer the potential to improve right rib fractures are noted. On the left, radiology reporting by reducing the repeti- there is persistent apical pleural thickening tive process of report drafting and identifying and apical scarring. Linear opacities projecting over the lower lobe are also possible medical errors. However, existing compatible with scarring, unchanged. report generation systems, despite achieving There is no left pleural effusion. There is high performances on natural language genera- no pneumothorax. … tion metrics such as CIDEr or BLEU, still suf- fer from incomplete and inconsistent genera- Image Text tions. Here we introduce two new simple re- Encoder Decoder contradiction wards to encourage the generation of factually complete and consistent radiology reports: one that encourages the system to generate radiol- Generated Report ogy domain entities consistent with the refer- … The heart size remains unchanged and is within normal limits. ence, and one that uses natural language in- Unchanged appearance of thoracic aorta. The pulmonary vasculature is not congested. Bilateral pleural effusions are again ference to encourage these entities to be de- noted and have increased in size on the right than the left. scribed in inferentially consistent ways. We The left-sided pleural effusion has increased in size and is now combine these with the novel use of an exist- moderate in size. ing semantic equivalence metric (BERTScore). We further propose a report generation sys- Figure 1: A (partial) example of a report generated tem that optimizes these rewards via reinforce- from our system (with “. . . ” representing abbreviated ment learning. On two open radiology report text). The system encodes images and generates text datasets, our system substantially improved from that encoded representation. Underlined words the F1 score of a clinical information extrac- are disease and anatomy entities. The shaded sentences tion performance by +22.1 (∆ + 63.9%). We are an example of a contradictory pair. further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete Automatic radiology report generation systems and consistent compared to the baselines. have achieved promising performance as mea- sured by widely used NLG metrics such as CIDEr 1 Introduction (Vedantam et al., 2015) and BLEU (Papineni et al., 2002) on several datasets (Li et al., 2018; Jing An important new application of natural language et al., 2019; Chen et al., 2020). However, reports generation (NLG) is to build assistive systems that that achieve high performance on these NLG met- take X-ray images of a patient and generate a tex- rics are not always factually complete or consis- tual report describing clinical observations in the tent. In addition to the use of inadequate metrics, images (Jing et al., 2018; Li et al., 2018; Liu et al., the factual incompleteness and inconsistency is- 2019; Boag et al., 2020; Chen et al., 2020). Figure sue in generated reports is further exacerbated by 1 shows an example of a radiology report generated the inadequate training of these systems. Specif- by such a system. This is a clinically important ically, the standard teacher-forcing training algo- task, offering the potential to reduce radiologists’ rithm (Williams and Zipser, 1989) used by most ex- repetitive work and generally improve clinical com- isting work can lead to a discrepancy between what munication (Kahn et al., 2009). the model sees during training and test time (Ran- 5288 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5288–5304 June 6–11, 2021. ©2021 Association for Computational Linguistics
zato et al., 2016), resulting in degenerate outputs main difference. To construct the NLI model for with factual hallucinations (Maynez et al., 2020). factENTNLI , we present a weakly supervised ap- Liu et al. (2019) and Boag et al. (2020) have shown proach that adapts an existing NLI model to the ra- that reports generated by state-of-the-art systems diology domain. We further present a report genera- still have poor quality when evaluated by their clin- tion model which directly optimizes a Transformer- ical metrics as measured with an information ex- based architecture with these rewards using rein- traction system designed for radiology reports. For forcement learning (RL). example, the generated report in Figure 1 is incom- We evaluate our proposed report generation plete since it neglects an observation of atelectasis model on two publicly available radiology report that can be found in the images. It is also incon- generation datasets. We find that optimizing the sistent since it mentions left-sided pleural effusion proposed rewards along with BERTScore by RL which is not present in the images. Indeed, we leads to generated reports that achieve substan- show that existing systems are inadequate in factual tially improved performance in the important clin- completeness and consistency, and that an image- ical metrics (Liu et al., 2019; Boag et al., 2020; to-text radiology report generation system can be Chen et al., 2020), demonstrating the higher clin- substantially improved by replacing widely used ical value of our approach. We make all our code NLG metrics with simple alternatives. and the expert-labeled test set for evaluating the ra- diology NLI model publicly available to encourage We propose two new simple rewards that can en- future research1 . To summarize, our contributions courage the factual completeness and consistency in this paper are: of the generated reports. First, we propose the Ex- act Entity Match Reward (factENT ) which captures 1. We propose two simple rewards for image- the completeness of a generated report by measur- to-text radiology report generation, which fo- ing its coverage of entities in the radiology domain, cus on capturing the factual completeness compared with a reference report. The goal of the and consistency of generated reports, and a reward is to better capture disease and anatomical weak supervision-based approach for training knowledge that are encoded in the entities. Sec- a radiology-domain NLI model to realize the ond, we propose the Entailing Entity Match Reward second reward. (factENTNLI ), which extends factENT with a nat- 2. We present a new radiology report genera- ural language inference (NLI) model that further tion model that directly optimizes these new considers how inferentially consistent the gener- rewards with RL, showing that previous ap- ated entities are with their descriptions in the ref- proaches that optimize traditional NLG met- erence. We add NLI to control the overestimation rics are inadequate, and that the proposed ap- of disease when optimizing towards factENT . We proach substantially improves performance on use these two metrics along with an existing seman- clinical metrics (as much as ∆ + 64.2%) on tic equivalence metric, BERTScore (Zhang et al., two publicly available datasets. 2020a), to potentially capture synonyms (e.g., “left 2 Related Work and right” effusions are synonymous with “bilat- eral” effusions) and distant dependencies between 2.1 Image-to-Text Radiology Report diseases (e.g., a negation like “. . . but underlying Generation consolidation or other pulmonary lesion not ex- Wang et al. (2018) and Jing et al. (2018) first pro- cluded”) that are present in radiology reports. posed multi-task learning models that jointly gener- Although recent work in summarization, dia- ate a report and classify disease labels from a chest logue, and data-to-text generation has tried to ad- X-ray image. Their models were extended to use dress this problem of factual incompleteness and multiple images (Yuan et al., 2019), to adopt a hy- inconsistency by using natural language inference brid retrieval-generation model (Li et al., 2018), or (NLI) (Falke et al., 2019; Welleck et al., 2019), to consider structure information (Jing et al., 2019). question answering (QA) (Wang et al., 2020a), or More recent work has focused on generating re- content matching constraint (Wang et al., 2020b) ports that are clinically consistent and accurate. Liu approaches, they either show negative results or et al. (2019) presented a system that generates ac- are not directly applicable to the generation of ra- curate reports by fine-tuning it with their Clinically 1 diology reports due to a substantial task and do- https://github.com/ysmiura/ifcc 5289
Coherent Reward. Boag et al. (2020) evaluated K× images several baseline generation systems with clinical Masked Self- metrics and found that standard NLG metrics are Attention Memory- ill-equipped for this task. Very recently, Chen et al. Augmented Add & Norm Attention (2020) proposed an approach to generate radiology Add & Norm K, V reports with a memory-driven Transformer. Our Q work is most related to Liu et al. (2019); their sys- N× Feed Cross- Forward Attention N× tem, however, is dependent on a rule-based infor- layers Add & Norm layers mation extraction system specifically created for Memory- Augmented Max-Pool chest X-ray reports and has limited robustness and Encoder #! generalizability to different domains within radiol- Meshed- Feed Attention ogy. By contrast, we aim to develop methods that Forward improve the factual completeness and consistency Add & Norm Add & Norm Meshed Decoder of generated reports by harnessing more robust sta- tistical models and are easily generalizable. # 2.2 Consistency and Faithfulness in Natural Figure 2: An overview of Meshed-Memory Trans- former extended to multiple images. Language Generation A variety of recent work has focused on consis- regions. We find Meshed-Memory Transformer tency and faithfulness in generation. Our work (Cornia et al., 2020) (M2 Trans) to be more ef- is inspired by Falke et al. (2019), Welleck et al. fective in our radiology report generation task than (2019), and Matsumaru et al. (2020) in using NLI the traditional RNN-based models and Transformer to rerank or filter generations in text summarization, models (an empirical result will be shown in §4), dialogue, and headline generations systems, respec- and therefore use it as our base architecture. tively. Other attempts in this direction include eval- 3 Methods uating consistency in generations using QA models (Durmus et al., 2020; Wang et al., 2020a; Maynez 3.1 Image-to-Text Radiology Report et al., 2020), with distantly supervised classifiers Generation with M2 Trans (Kryściński et al., 2020), and with task-specific con- Formally, given K individual images x1...K of a tent matching constraints (Wang et al., 2020b). Liu patient, our task involves generating a sequence of et al. (2019) and Zhang et al. (2020b) studied im- words to form a textual report ŷ, which describes proving the factual correctness in generating radiol- the clinical observations in the images. This task ogy reports with rule-based information extraction resembles image captioning, except with multiple systems. Our work mainly differs from theirs in the images as input and longer text sequences as out- direct optimization of factual completeness with an put. We therefore extend a state-of-the-art image entity-based reward and of factual consistency with captioning model, M2 Trans (Cornia et al., 2020), a statistical NLI-based reward. with multi-image input as our base architecture. 2.3 Image Captioning with Transformer We first briefly introduce this model and refer inter- ested readers to Cornia et al. (2020). The problem of generating text from image data Figure 2 illustrates an overview of the M2 Trans has been widely studied in the image captioning model. Given an image xk , image regions are first setting. While early work focused on combining extracted with a CNN as X = CNN(xk ). X is convolutional neural network (CNN) and recurrent then encoded with a memory-augmented attention neural network (RNN) architectures (Vinyals et al., process Mmem (X) as 2015), more recent work has discovered the ef- Mmem (X) = Att(Wq X, K, V ) (1) fectiveness of using the Transformer architecture T (Vaswani et al., 2017). Li et al. (2019) and Pan et al. QK Att(Q, K, V ) = softmax √ V (2) (2020) introduced an attention process to exploit se- d K = [Wk X; M k ] (3) mantic and visual information into this architecture. V = [Wv X; M v ] (4) Herdade et al. (2019), Cornia et al. (2020), and Guo et al. (2020) extended this architecture to learn where Wq , Wk , Wv are weights, M k , M v are geometrical and other relationships between input memory matrices, d is a scaling factor, and [∗; ∗] 5290
is the concatenation operation. Att(Q, K, V ) is precision (pr) and recall (rc) of entity match are an attention process derived from the Transformer calculated as architecture (Vaswani et al., 2017) and extended to P include memory matrices that can encode a priori e∈Egen δ(e, Eref ) prENT = (8) knowledge between image regions. In the encoder, |Egen | P this attention process is a self-attention process e∈Eref δ(e, Egen ) rcENT = (9) since all of the query Q, the key K, and the value |Eref | ( V depend on X. Mmem (X) is further processed δ(e, E) = 1, for e ∈ E (10) with a feed forward layer, a residual connection, 0, otherwise and a layer normalization to output X̃. This encod- ing process can be stacked N times and is applied The harmonic mean of precision and recall is taken to K images, and n-th layer output of K image as factENT to reward a balanced match of entities. will be X̃ n,K . We used Stanza (Qi et al., 2020) and its clinical The meshed decoder first processes an encoded models (Zhang et al., 2020c) as a named entity text Y with a masked self-attention and further recognizer for radiology reports. For example in processes it with a feed forward layer, a residual the case of Figure 1, the common entities among connection, and a layer normalization to output Ÿ . the reference report and the generated report are Ÿ is then passed to a cross attention C(X̃ n,K , Ÿ ) pleural and effusion, resulting to factENT = 33.3. and a meshed attention Mmesh (X̃ N ,K , Ÿ ) as 3.2.2 Entailing Entity Match Reward (factENTNLI ) X Mmesh (X̃ N ,K , Ÿ ) = αn C(X̃ n,K , Ÿ ) (5) n We additionally designed an F-score style reward C(X̃ n,K , Ÿ ) = max(Att(Wq Ÿ , Wk X̃ n,K , Wv X̃ n,K )) K that expands factENT with NLI to capture factual (6) consistency. NLI is used to control the overestima- αn = σ Wn [Y ; C(X̃ n,K , Ÿ )] + bn (7) tion of disease when optimizing towards factENT . In factENTNLI , δ in Eq. 10 is expanded to where is element-wise multiplication, maxK is max-pooling over K images, σ is sigmoid function, 1, for e ∈ E ∧ NLIe (P , h) 6= contradiction φ(e,E)= 1, for NLIe (P , h) = entailment Wn is a weight, and bn is a bias. The weighted 0, otherwise summation in Mmesh (X̃ N ,K , Ÿ ) exploits both (11) low-level and high-level information from the N NLIe (P , h) = nli(p̂, h) where p̂ = arg max sim(h, p) stacked encoder. Differing from the self-attention p∈P process in the encoder, the cross attention uses a (12) query that depends on Y and a key and a value that depend on X. Mmesh (X̃ N ,K , Ÿ ) is further where h is a sentence that includes e, P is all sen- processed with a feed forward layer, a residual con- tences in a counter part text (if h is a sentence nection, and a layer normalization to output Ỹ . As in a generated report, P is all sentences in the like in the encoder, the decoder can be stacked N corresponding reference report), nli(∗, ∗) is an times to output Ỹ N . Ỹ N is further passed to a feed NLI function that returns an NLI label which is forward layer to output report ŷ. one of {entailment, neutral, contradiction}, and sim(∗, ∗) is a text similarity function. We used 3.2 Optimization with Factual Completeness BERTScore (Zhang et al., 2020a) as sim(∗, ∗) in and Consistency the experiments (the detail of BERTScore can be found in Appendix A). The harmonic mean of preci- 3.2.1 Exact Entity Match Reward (factENT ) sion and recall is taken as factENTNLI to encourage We designed an F-score entity match reward to cap- a balanced factual consistency between a generated ture factual completeness. This reward assumes text and the corresponding reference text. For ex- that entities encode disease and anatomical knowl- ample in the case of Figure 1, the sentence “The edge that relates to factual completeness. A named left-sided pleural effusion has increased in size and entity recognizer is applied to ŷ and the correspond- is now moderate in size.” will be contradictory ing reference report y. Given entities Egen and to “There is no left pleural effusion.” resulting in Eref recognized from ygen and yref respectively, pleural and effusion being rejected in ygen . 5291
3.2.3 Joint Loss for Optimizing Factual Test Accuracy Training Data #samples RadNLI MedNLI Completeness and Consistency MedNLI 13k 53.3 80.9 We integrate the proposed factual rewards into self- MedNLI + RadNLI 19k 77.8 79.8 critical sequence training (Rennie et al., 2017). An RL loss LRL is minimized as the negative expec- Table 1: The accuracies of the NLI model trained with tation of the reward r. The gradient of the loss is the weakly-supervised approach. RadNLI is the pro- posed NLI for radiology reports. The values are the av- estimated with a single Monte Carlo sample as erage of 5 runs and the bold values are the best results ∇θ LRL (θ) = −∇θ log Pθ (ŷsp |x1...K ) (r(ŷsp ) − r(ŷgd )) of each test set. (13) introduce a certain level of similarity between where ŷsp is a sampled text and ŷgd is a greedy s1 and s2 . decoded text. Paulus et al. (2018) and Zhang et al. (2020b) have shown that a generation can be im- Neutral 4 (N4) (1) NE of s1 are equal to NE of proved by combining multiple losses. We combine s2 and (2) s1 and s2 include observation key- a factual metric loss with a language model loss words. and an NLG loss as Contradiction 1 (C1) (1) NE of s1 is equal or a L = λ1 LNLL + λ2 LRL_NLG + λ3 LRL_FACT (14) subset to NE of s2 and (2) s1 is a negation of where LNLL is a language model loss, LRL_NLG s2 . is the RL loss using an NLG metric (e.g., CIDEr The rules rely on a semantic similarity measure and or BERTScore), LRL_FACT is the RL loss using a the overlap of entities to determine the relationship factual reward (e.g., factENT or factENTNLI ), and between s1 and s2 . In the neutral rules and the λ∗ are scaling factors to balance the multiple losses. contradiction rule, we included similarity measures 3.3 A Weakly-Supervised Approach for to avoid extracting easy to distinguish sentence Radiology NLI pairs. We evaluated this NLI by preparing training data, We propose a weakly-supervised approach to con- validation data, and test data. For the training data, struct an NLI model for radiology reports. (There the training set of MIMIC-CXR (Johnson et al., already exists an NLI system for the medical do- 2019) is used as the source of sentence pairs. 2k main, MedNLI (Romanov and Shivade, 2018), but pairs are extracted for E1 and C1, 0.5k pairs are we found that a model trained on MedNLI does not extracted for N1, N2, N3, and N4, resulting in a work well on radiology reports.) Given a large scale total of 6k pairs. The training set of MedNLI is dataset of radiology reports, a sentence pair is sam- also used as additional data. For the validation data pled and filtered with weakly-supervised rules. The and the test data, we sampled 480 sentence pairs rules are prepared to extract a randomly sampled from the validation section of MIMIC-CXR and sentence pair (s1 and s2 ) that are in an entailment, had them annotated by two experts: one medical neutral, or contradiction relation. We designed the expert and one NLP expert. Each pair is annotated following 6 rules for weak-supervision. twice swapping its premise and hypothesis result- Entailment 1 (E1) (1) s1 and s2 are semantically ing in 960 pairs and are split in half resulting in 480 similar and (2) NE of s2 is a subset or equal pair for a validation set and 480 pairs for a test set. to NE of s1 . The test set of MedNLI is also used as alternative test data. Neutral 1 (N1) (1) s1 and s2 are semantically sim- We used BERT (Devlin et al., 2019) as an NLI ilar and (2) NE of s1 is a subset of NE of s2 . model since it performed as a strong baseline in Neutral 2 (N2) (1) NE of s1 are equal to NE of s2 the existing MedNLI system (Ben Abacha et al., and (2) s1 include an antonym of a word in 2019), and used Stanza (Qi et al., 2020) and its clin- s2 . ical models (Zhang et al., 2020c) as a named entity recognizer. Table 1 shows the result of the model Neutral 3 (N3) (1) NE types of s1 are equal to NE trained with and without the weakly-supervised types of s2 and (2) NE of s1 is different from data. The accuracy of NLI on radiology data in- NE of s2 . NE types are used in this rule to creased substantially by +24.5% with the addition 5292
of the radiology NLI training set. (See Appendix the reference3 . The micro average of accuracy, pre- A for the detail of the rules, the datasets, and the cision, recall, and F1 scores are calculated over 5 model configuration.) observations (following previous work (Irvin et al., 2019)) for: atelectasis, cardiomegaly, consolida- 4 Experiments tion, edema, and pleural effusion4 . 4.1 Data factENT & factENTNLI : We additionally include our proposed rewards factENT and We used the training and validation sets of MIMIC- factENTNLI as metrics to compare their values for CXR (Johnson et al., 2019) to train and validate different models. models. MIMIC-CXR is a large publicly available database of chest radiographs. We extracted the 4.3 Model Variations findings sections from the reports with a text ex- We used M2 Trans as our report generation model traction tool for MIMIC-CXR2 , and used them as and used DenseNet-121 (Huang et al., 2017) as our reference reports as in previous work (Liu et al., our image encoder. We trained M2 Trans with the 2019; Boag et al., 2020). Findings section is a nat- following variety of joint losses. ural language description of the important aspects in a radiology image. The reports with empty find- NLL M2 Trans simply optimized with NLL loss ings sections were discarded, resulting in 152173 as a baseline loss. and 1196 reports for the training and validation set, respectively. We used the test set of MIMIC-CXR NLL+CDr CIDEr-D and NLL loss is jointly opti- and the entire Open-i Chest X-ray dataset (Demner- mized with λ1 = 0.01 and λ2 = 0.99 for the Fushman et al., 2012) as two individual test sets. scaling factors. Open-i is another publicly available database of NLL+BS The F1 score of BERTScore and NLL chest radiographs which has been widely used in loss is jointly optimized with λ1 = 0.01 and past studies. We again extracted the findings sec- λ2 = 0.99. tions, resulting in 2347 reports for MIMIC-CXR and 3335 reports for Open-i. Open-i is used only NLL+BS+fcE factENT is added to NLL+BS with for testing since the number of reports is too small λ1 = 0.01, λ2 = 0.495, and λ3 = 0.495. to train and test a neural report generation model. NLL+BS+fcEN factENTNLI is added to NLL+BS 4.2 Evaluation Metrics with λ1 = 0.01, λ2 = 0.495, and λ3 = BLEU4, CIDEr-D & BERTScore: We first use 0.495. general NLG metrics to evaluate the generation We additionally prepared three previous models quality. These metrics include the 4-gram BLEU that have been tested on MIMIC-CXR. scroe (Papineni et al., 2002, BLEU4), CIDEr score (Vedantam et al., 2015) with gaming penalties TieNet We reimplemented the model of Wang (CIDEr-D), and the F1 score of the BERTScore et al. (2018) consisting of a CNN encoder (Zhang et al., 2020a). and an RNN decoder optimized with a multi- Clinical Metrics: However, NLG metrics such as task setting of language generation and image BLEU and CIDEr are known to be inadequate for classification. evaluating factual completeness and consistency. We therefore followed previous work (Liu et al., CNN-RNN2 We reimplemented the model of 2019; Boag et al., 2020; Chen et al., 2020) by ad- Liu et al. (2019) consisting of a CNN encoder ditionally evaluating the clinical accuracy of the and a hierarchical RNN decoder optimized generated reports using a clinical information ex- with CIDEr and Clinically Coherent Reward traction system. We use CheXbert (Smit et al., 3 We used CheXbert instead of CheXpert (Irvin et al., 2019) 2020), an information extraction system for chest since CheXbert was evaluated to be approximately 5.5% more reports, to extract the presence status of a series accurate than CheXpert. The evaluation using CheXpert can be found in Appendix C. of observations (i.e., whether a disease is present 4 These 5 observations are evaluated to be most represented or not), and score a generation by comparing the in real-world radiology reports and therefore using these 5 values of these observations to those obtained from observations (and excluding others) leads to less variance and more statistical strength in the results. We include the detailed 2 https://github.com/MIT-LCP/mimic-cxr/tree/master/txt results of the clinical metrics in Appendix C for completeness. 5293
NLG Metrics Clinical Metrics (micro-avg) Factual Rewards Dataset Model BL4 CDr BS P R F1 acc. fcE fcEN Previous models TieNet (Wang et al., 2018) 8.1 37.2 49.2 38.6 20.9 27.1 74.0 − − CNN-RNN2 (Liu et al., 2019) 7.6 44.7 41.2 66.4 18.7 29.2 79.0 − − R2Gen (Chen et al., 2020) 8.6 40.6 50.8 41.2 29.8 34.6 73.9 − − Proposed approach without proposed optimization MIMIC- M2 Trans w/ NLL 10.5 44.5 51.2 48.9 41.1 44.7 76.5 27.3 24.4 CXR M2 Trans w/ NLL+CDr 13.3 67.0 55.9 50.0 51.3 50.6 76.9 35.2 32.9 Proposed approach M2 Trans w/ NLL+BS 12.2 58.4 58.4 46.3 67.5 54.9 74.4 35.9 33.0 M2 Trans w/ NLL+BS+fcE 11.1 49.2 57.2 46.3 73.2 56.7 74.2 39.5 34.8 M2 Trans w/ NLL+BS+fcEN 11.4 50.9 56.9 50.3 65.1 56.7 77.1 38.5 37.9 Previous models TieNet (Wang et al., 2018) 9.0 65.7 56.1 46.9 15.9 23.7 96.0 − − CNN-RNN2 (Liu et al., 2019) 12.1 87.2 57.1 55.1 7.5 13.2 96.1 − − R2Gen (Chen et al., 2020) 6.7 61.4 53.8 27.0 17.3 21.1 94.9 − − Proposed approach without proposed optimization Open-i M2 Trans w/ NLL 8.2 64.4 53.1 44.7 32.7 37.8 95.8 31.1 34.1 M2 Trans w/ NLL+CDr 13.4 97.2 59.9 48.2 24.2 32.2 96.0 40.6 42.9 Proposed approach M2 Trans w/ NLL+BS 12.3 87.3 62.4 47.7 46.6 47.2 95.9 41.5 44.1 M2 Trans w/ NLL+BS+fcE 12.0 99.6 62.6 44.0 53.5 48.3 95.5 44.4 46.8 M2 Trans w/ NLL+BS+fcEN 13.1 103.4 61.0 48.7 46.9 47.8 96.0 43.6 47.1 Table 2: Results of the baselines and our M2 Trans model trained with different joint losses. For the metrics, BL4, CDr, and BS represent BLEU4, CIDEr-D, and the F1 score of BERTScore; P, R, F1 and acc. represent the precision, recall, F1 , and accuracy scores output by the clinical CheXbert labeler, respectively. For the rewards, fcE and fcEN represent factENT and factENTNLI , respectively. which is a reward based on the clinical met- on MIMIC-CXR with M2 Trans when compared rics. against M2 Trans w/ BS. For the clinical metrics, the best recalls and F1 scores are obtained with R2Gen The model of Chen et al. (2020) with a M2 Trans using factENT as a reward, achieving a CNN encoder and a memory-driven Trans- substantial +22.1 increase (∆+63.9%) in F1 score former optimized with NLL loss. We used the against the best baseline R2Gen. We further find publicly available official code and its check- that using factENTNLI as a reward leads to higher point as its implementation. precision and accuracy compared to factENT with For reproducibility, we include model configura- decreases in the recalls. The best precisions and tions and training details in Appendix B. accuracies were obtained in the baseline CNN- RNN2 . This is not surprising since this model 5 Results and Discussions directly optimizes the clinical metrics with its Clin- ically Coherent Reward. However, this model is 5.1 Evaluation with NLG Metrics and strongly optimized against precision resulting in Clinical Metrics the low recalls and F1 scores. Table 2 shows the results of the baselines5 and M2 Trans optimized with the five different joint The results of M2 Trans without the proposed losses. We find that the best result for a metric or rewards and BERTScore reveal the strength of M2 a reward is achieved when that metric or reward is Trans and the inadequacy of NLL loss and CIDEr used directly in the optimization objective. Notably, for factual completeness and consistency. M2 for the proposed factual rewards, the increases of Trans w/ NLL shows strong improvements in the +3.6 factENT and +4.9 factENTNLI are observed clinical metrics against R2Gen. These improve- 5 These MIMIC-CXR scores have some gaps from the pre- ments are a little surprising since both models are viously reported values with some possible reasons. First, Transformer-based models and are optimized with TieNet and CNN-RNN2 in Liu et al. (2019) are evaluated NLL loss. We assume that these improvements on a pre-release version of MIMIC-CXR. Second, we used report-level evaluation for all models, but Chen et al. (2020) are due to architecture differences such as memory tested R2Gen using image-level evaluation. matrices in the encoder of M2 Trans. The differ- 5294
M2 Trans w/ BS R2Gen No Metric ρ Proposed (simple) (Chen et al., 2020) difference BLEU4 0.092 36.5% 12.0% 51.5% CIDEr-D 0.034 BERTScore 0.155 Table 3: The human evaluation result for randomly factENT 0.196 sampled 100 reports from the test set of MIMIC-CXR factENTNLI 0.255 by two board-certified radiologists. Table 4: The Spearman correlations ρ of NLG met- rics and factual metrics against clinical accuracy. The ence between NLL and NLL+CDr on M2 Trans strongest correlation among all metrics is shown is indicates that NLL and CIDEr are unreliable for bold. factual completeness and consistency. used to estimate the performance of the clinical 5.2 Human Evaluation metrics to see whether the proposed rewards can We performed a human evaluation to further con- be used in an evaluation where a strong clinical firm whether the generated radiology reports are information extraction system like CheXbert is not factually complete and consistent. Following prior available. Table 4 shows Spearman correlations studies of radiology report summarization (Zhang calculated on the generated reports of NLL+BS. et al., 2020b) and image captioning evaluation factENTNLI shows the strongest correlation with (Vedantam et al., 2015), we designed a simple the clinical accuracy which aligns with the opti- human evaluation task. Given a reference report mization where the best accuracy is obtained with (R) and two candidate model generated reports NLL+ BS+factENTNLI . This correlation value is (C1, C2), two board-certified radiologists decided slightly lower than a Spearman correlation which whether C1 or C2 is more factually similar to R. Maynez et al. (2020) observed with NLI for the To consider cases when C1 and C2 are difficult factual data (0.264). The result suggests the effec- to differentiate, we also prepared “No difference” tiveness of using the factual rewards to estimate the as an answer. We sampled 100 reports randomly factual completeness and consistency of radiology from the test set of MIMIC-CXR for this evalua- reports, although the correlations are still limited, tion. Since this evaluation is (financially) expensive with some room for improvement. and there has been no human evaluation between the baseline models, we selected R2Gen as the 5.4 Qualitative Analysis of Improved Clinical best previous model and M2 Trans w/ BS as the Completeness and Consistency most simple proposed model, in order to be able The evaluation with the clinically findings metrics to weakly infer that all of our proposed models are showed improved generation performance by in- better than all of the baselines. Table 3 shows the tegrating BERTScore, factENT , and factENTNLI . result of the evaluation. The majority of the reports As a qualitative analysis, we examined some of were labeled “No difference” but the proposed ap- the generated reports to see the improvements. Ex- proach received three times as much preference as ample 1 in Figure 3 shows the improved factual the baseline. completeness and consistency with BERTScore. There are two main reasons why “No difference” The atelectasis is correctly generated and left plu- was frequent in human evaluation. First, we found ral effusion is correctly suppressed with NLL+BS. that a substantial portion of the examples were Example 2 in Figure 4 shows the improved fac- normal studies (no abnormal observations), which tual completeness with factENTNLI . The edema leads to generated reports of similar quality fromis correctly generated and atelectasis is correctly both models. Second, in some reports with multiplesuppressed with NLL+BS+fcEN . These examples abnormal observations, both models made mistakes reveal the strength of integrating the three metrics on a subset of these observations, making it difficult to generate factually complete and consistent re- to decide which model output was better. ports. Despite observing large improvements with our 5.3 Estimating Clinical Accuracy with model in the clinical finding metrics evaluation, the Factual Rewards model is still not complete and some typical factual The integrations of factENT and factENTNLI errors can be found in their generated reports. For showed improvements in the clinical metrics. We example, Example 3 in Figure 4 includes a compar- further examined whether these rewards can be ison of an observation against a previous study as 5295
Images Reference R2Gen M2 Trans w/ NLL+BS Large right pleural effusion is unchanged PA and lateral chest views were obtained in size. There is associated right basilar with patient in upright position. Anal- atelectasis/scarring, also stable. Healed ysis is performed in direct comparison As compared to prior chest radiograph right rib fractures are noted. On the with the next preceding similar study from DATE, there has been interval im- left, there is persistent apical pleural of DATE. The heart size remains un- provement of the right pleural effusion. Example 1 thickening and apical scarring. Linear changed and is within normal limits. Un- There is a persistent opacity at the right opacities projecting over the lower lobe changed appearance of thoracic aorta. lung base. There is persistent atelecta- are also compatible with scarring, un- The pulmonary vasculature is not con- sis at the right lung base. There is no changed. There is no left pleural effu- gested. Bilateral pleural effusions are left pleural effusion. There is no pneu- sion. There is no pneumothorax. Hilar again noted and have increased in size mothorax. The cardiomediastinal and hi- and cardiomediastinal contours are dif- on the right than the left. The left-sided lar contours are unchanged. ficult to assess, but appear unchanged. pleural effusion has increased in size and Vascular stent is seen in the left axil- is now moderate in size. lary/subclavian region. Figure 3: An example of radiology reports generated by R2Gen and by the proposed model with the optimization integrating BERTScore. Repeated sentences are removed from the example to improve readability. Images Reference M2 Trans w/ NLL+BS M2 Trans w/ NLL+BS+fcEN The cardiomediastinal and hilar contours Assessment is limited by patient rotation. are stable. The aorta is tortuous. The Frontal and lateral radiographs of the The patient is status post median ster- patient is status post median sternotomy. chest were acquired. There is new mild notomy and CABG. Heart size is moder- The heart is mildly enlarged. The aorta interstitial pulmonary edema. A small ately enlarged. The aorta is tortuous and Example 2 is tortuous. The lung volumes are lower right pleural effusion may be minimally diffusely calcified. There is mild pul- compared to the prior chest radiograph. increased. There is also likely a trace left monary vascular congestion. Small bilat- Mild pulmonary edema is present. Small pleural effusion. There is no focal con- eral pleural effusions are present. Patchy bilateral pleural effusions are present. solidation. The heart size is not signifi- opacities in the lung bases likely reflect There is no focal consolidation. No cantly changed. There is no pneumotho- atelectasis. No pneumothorax is identi- pneumothorax is seen. Median ster- rax. Midline sternotomy wires are noted. fied. There are no acute osseous abnor- notomy wires and mediastinal clips are malities. noted. A right-sided hemodialysis catheter ter- minates at the right atrium. Again seen are reticular interstitial opacities dis- Right-sided dual lumen central venous The cardiomediastinal and hilar contours tributed evenly across both lungs, sta- catheter tip terminates in the lower SVC. are normal. The lung volumes are low. ble over multiple prior radiographs, pre- Heart size remains mildly enlarged. The The lung volumes are present. There is Example 3 viously attributed to chronic hypersensi- mediastinal and hilar contours are un- mild pulmonary edema. There is no fo- tivity pneumonitis on the chest CT from changed. There is no pulmonary edema. cal consolidation. No pleural effusion DATE. The cardiac and mediastinal sil- Minimal atelectasis is noted in the lung or pneumothorax is seen. A right-sided houettes are unchanged. The central pul- bases without focal consolidation. No central venous catheter is seen with tip monary vessels appear more prominent pleural effusion or pneumothorax is in the right atrium. since the DATE study. Superimposed seen. There are no acute osseous abnor- mild edema cannot be excluded. There is malities. no focal consolidation, pleural effusion, or pneumothorax. Figure 4: Examples of radiology reports generated by the proposed model with the optimization integrating BERTScore and factENTNLI . Repeated sentences are removed from the examples to improve readability. “. . . appear more prominent since . . . ” in the refer- directly optimizes these rewards with self-critical ence but our model (or any previous models) can reinforcement learning. On two open datasets, we not capture this kind of comparison since the model showed that our system generates reports that are is not designed to take account the past reports of more factually complete and consistent than the a patient as input. Additionally, in this example, baselines and leads to reports with substantially edema is mentioned with uncertainty as “cannot be higher scores in clinical metrics. The integration of excluded” in the reference but the generated report entities and NLI to improve the factual complete- with factENTNLI simply indicates it as “There is ness and consistency of generation is not restricted mild pulmonary edema”. to the domain of radiology reports, and we predict that a similar approach might similarly improve 6 Conclusion other data-to-text tasks. We proposed two new simple rewards and com- bined them with a semantic equivalence metric to improve image-to-text radiology report generation Acknowledgements systems. The two new rewards make use of ra- diology domain entities extracted with a named We would like to thank the anonymous reviewers entity recognizer and a weakly-supervised NLI to and the members of the Stanford NLP Group for capture the factual completeness and consistency their very helpful comments that substantially im- of the generated reports. We further presented a proved this paper. Transformer-based report generation system that 5296
References Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming Asma Ben Abacha, Chaitanya Shivade, and Dina objects into words. In Advances in Neural Informa- Demner-Fushman. 2019. Overview of the MEDIQA tion Processing, volume 32, pages 11137–11147. 2019 shared task on textual inference, question en- tailment and question answering. In Proceedings of Gao Huang, Zhuang Liu, Laurens van der Maaten, and the 18th BioNLP Workshop and Shared Task, pages Kilian Q. Weinberger. 2017. Densely connected con- 370–379. volutional networks. In Proceedings of the IEEE William Boag, Tzu-Ming Harry Hsu, Matthew Mcder- Conference on Computer Vision and Pattern Recog- mott, Gabriela Berner, Emily Alesentzer, and Pe- nition, pages 2261–2269. ter Szolovits. 2020. Baselines for Chest X-Ray Re- Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yi- port Generation. In Proceedings of the Machine fan Yu, Silviana Ciurea-Ilcus, Chris Chute, Hen- Learning for Health NeurIPS Workshop, volume rik Marklund, Behzad Haghgoo, Robyn L. Ball, 116, pages 126–140. Katie S. Shpanskaya, Jayne Seekins, David A. Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xi- Mong, Safwan S. Halabi, Jesse K. Sandberg, ang Wan. 2020. Generating radiology reports via Ricky Jones, David B. Larson, Curtis P. Langlotz, memory-driven transformer. In Proceedings of The Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. 2020 Conference on Empirical Methods in Natural Ng. 2019. CheXpert: A large chest radiograph Language Processing, pages 1439–1449. dataset with uncertainty labels and expert compari- son. In The Thirty-Third AAAI Conference on Artifi- Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, cial Intelligence, volume 33, pages 590–597. and Rita Cucchiara. 2020. Meshed-Memory Trans- former for Image Captioning. In Proceedings of the Baoyu Jing, Zeya Wang, and Eric Xing. 2019. Show, IEEE/CVF Conference on Computer Vision and Pat- describe and conclude: On exploiting the structure tern Recognition, pages 10575–10584. information of chest x-ray reports. In Proceedings of the 57th Annual Meeting of the Association for Dina Demner-Fushman, Samee Antani, Simpsonl Computational Linguistics, pages 6570–6580. Matthew, and George R. Thoma. 2012. Design and development of a multimodal biomedical informa- Baoyu Jing, Pengtao Xie, and Eric Xing. 2018. On tion retrieval system. Journal of Computing Science the automatic generation of medical imaging reports. and Engineering, 6(2):168–177. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 1: Long Papers), pages 2577–2586. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Alistair E. W. Johnson, Tom J. Pollard, Seth J. standing. In Proceedings of the 2019 Conference of Berkowitz, Nathaniel R. Greenbaum, Matthew P. the North American Chapter of the Association for Lungren, Chih-ying Deng, Roger G. Mark, and Computational Linguistics: Human Language Tech- Steven Horng. 2019. MIMIC-CXR, a de-identified nologies, Volume 1 (Long and Short Papers), pages publicly available database of chest radiographs with 4171–4186. free-text reports. Scientific Data, 6(317). Esin Durmus, He He, and Mona Diab. 2020. FEQA: A Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li- question answering evaluation framework for faith- wei H. Lehman, Mengling Feng, Mohammad Ghas- fulness assessment in abstractive summarization. In semi, Benjamin Moody, Peter Szolovits, Leo An- Proceedings of the 58th Annual Meeting of the Asso- thony Celi, and Roger G. Mark. 2016. MIMIC-III, ciation for Computational Linguistics, pages 5055– a freely accessible critical care database. Scientific 5070. Data, 3(16035). Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Charles E. Kahn, Curtis P. Langlotz, Elizabeth S. Burn- Ranking generated summaries by correctness: An in- side, John A. Carrino, David S. Channin, David M. teresting but challenging application for natural lan- Hovsepian, and Daniel L. Rubin. 2009. Toward guage inference. In Proceedings of the 57th Annual best practices in radiology reporting. Radiology, Meeting of the Association for Computational Lin- 252(3):852–856. guistics, pages 2214–2220. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Christiane Fellbaum, editor. 1998. WordNet: A Lexical method for stochastic optimization. In International Database for English. MIT Press. Conference for Learning Representations. Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Shichen Lu, and Hanqing Lu. 2020. Normalized and Richard Socher. 2020. Evaluating the factual and geometry-aware self-attention network for im- consistency of abstractive text summarization. In age captioning. In IEEE/CVF Conference on Com- Proceedings of the 2020 Conference on Empirical puter Vision and Pattern Recognition, pages 10324– Methods in Natural Language Processing, pages 10333. 9332–9346. 5297
Christy Yuan Li, Xiaodan Liang, Zhiting Hu, and Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Eric P Xing. 2018. Hybrid retrieval-generation re- and Wojciech Zaremba. 2016. Sequence level train- inforced agent for medical image report generation. ing with recurrent neural networks. In International In Advances in Neural Information Processing, vol- Conference on Learning Representations. ume 31, pages 1530–1540. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Jerret Ross, and Vaibhava Goel. 2017. Self-critical Entangled transformer for image captioning. In Pro- sequence training for image captioning. In Proceed- ceedings of the IEEE/CVF International Conference ings of the IEEE Conference on Computer Vision on Computer Vision, pages 8927–8936. and Pattern Recognition, pages 1179–1195. Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew Mc- Alexey Romanov and Chaitanya Shivade. 2018. Dermott, Willie Boag, Wei-Hung Weng, Peter Lessons from natural language inference in the clin- Szolovits, and Marzyeh Ghassemi. 2019. Clinically ical domain. In Proceedings of the 2018 Conference accurate chest x-ray report generation. In Proceed- on Empirical Methods in Natural Language Process- ings of the 4th Machine Learning for Healthcare ing, pages 1586–1596. Conference, volume 106, pages 249–269. Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Kazuki Matsumaru, Sho Takase, and Naoaki Okazaki. Pareek, Andrew Ng, and Matthew Lungren. 2020. 2020. Improving truthfulness of headline genera- Combining automatic labelers and expert annota- tion. In Proceedings of the 58th Annual Meeting tions for accurate radiology report labeling using of the Association for Computational Linguistics, BERT. In Proceedings of the 2020 Conference on pages 1335–1346. Empirical Methods in Natural Language Processing, pages 1500–1519. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factu- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ality in abstractive summarization. In Proceedings Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz of the 58th Annual Meeting of the Association for Kaiser, and Illia Polosukhin. 2017. Attention is all Computational Linguistics, pages 1906–1919. you need. In Advances in neural information pro- Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. cessing systems, volume 30, pages 5998–6008. X-linear attention networks for image captioning. In Ramakrishna Vedantam, C. Lawrence Zitnick, and Proceedings of the IEEE/CVF Conference on Com- Devi Parikh. 2015. CIDEr: Consensus-based image puter Vision and Pattern Recognition, pages 10968– description evaluation. In Proceedings of the IEEE 10977. Conference on Computer Vision and Pattern Recog- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- nition, pages 4566–4575. Jing Zhu. 2002. Bleu: a method for automatic eval- uation of machine translation. In Proceedings of the Oriol Vinyals, Alexander Toshev, Samy Bengio, and 40th Annual Meeting of the Association for Compu- Dumitru Erhan. 2015. Show and tell: A neural im- tational Linguistics, pages 311–318. age caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- Romain Paulus, Caiming Xiong, and Richard Socher. nition, pages 3156–3164. 2018. A deep reinforced model for abstractive sum- marization. In International Conference on Learn- Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020a. ing Representations. Asking and answering questions to evaluate the fac- tual consistency of summaries. In Proceedings of Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi the 58th Annual Meeting of the Association for Com- Bagheri, Ronald Summers, and Zhiyong Lu. 2018. putational Linguistics, pages 5008–5020. Negbio: a high-performance tool for negation and uncertainty detection in radiology reports. In AMIA Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and 2018 Informatics Summit. Ronald M. Summers. 2018. TieNet: Text-image em- bedding network for common thorax disease classifi- Jeffrey Pennington, Richard Socher, and Christopher cation and reporting in chest x-rays. In Proceedings Manning. 2014. GloVe: Global vectors for word of the IEEE Conference on Computer Vision and Pat- representation. In Proceedings of the 2014 Confer- tern Recognition, pages 9049–9058. ence on Empirical Methods in Natural Language Processing, pages 1532–1543. Zhenyi Wang, Xiaoyang Wang, Bang An, Dong Yu, and Changyou Chen. 2020b. Towards faithful neural Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, table-to-text generation with content-matching con- and Christopher D. Manning. 2020. Stanza: A straints. In Proceedings of the 58th Annual Meet- python natural language processing toolkit for many ing of the Association for Computational Linguistics, human languages. In Proceedings of the 58th An- pages 1072–1086. nual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101– Sean Welleck, Jason Weston, Arthur Szlam, and 108. Kyunghyun Cho. 2019. Dialogue natural language 5298
inference. In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguistics, pages 3731–3741. Ronald J Williams and David Zipser. 1989. A learn- ing algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270– 280. Zhaofeng Wu, Yan Song, Sicong Huang, Yuanhe Tian, and Fei Xia. 2019. WTMED at MEDIQA 2019: A hybrid approach to biomedical natural language in- ference. In Proceedings of the 18th BioNLP Work- shop and Shared Task, pages 415–426. Jianbo Yuan, Haofu Liao, Rui Luo, and Jiebo Luo. 2019. Automatic radiology report generation based on multi-view image fusion and medical concept en- richment. In Medical Image Computing and Com- puter Assisted Intervention, pages 721–729. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020a. BERTScore: Evaluating text generation with BERT. In Interna- tional Conference on Learning Representations. Yuhao Zhang, Derek Merck, Emily Tsai, Christo- pher D. Manning, and Curtis Langlotz. 2020b. Op- timizing the factual correctness of a summary: A study of summarizing radiology reports. In Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5108–5120. Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D. Manning, and Curtis P. Langlotz. 2020c. Biomed- ical and clinical English model packages in the Stanza Python NLP library. arXiv preprint arXiv:2007.14640. 5299
A Detail of Radiology NLI included in the premise. The following is an exam- ple of a sentence pair that matches N1 with entities A.1 Rules & Examples of Weakly-Supervised in bold: Radiology NLI We prepared the 6 rules (E1, N1–N4, and C1) to s1 There is no pulmonary edema or definite con- train the weakly-supervised radiology NLI. The solidation. rules are applied against sentence pairs consisting from premises (s1 ) and hypotheses (s2 ) to extract s2 There is no focal consolidation, pleural effu- pairs that are in entailment, neutral, or contradic- sion, or pulmonary edema. tion relation. Neutral Rule 2: N2 Entailment Rule: E1 1. The named entities of s1 are equal to the 1. s1 and s2 are semantically similar. named entities of s2 as NE(s1 ) = NE(s2 ). 2. The named entities (NE) of s2 is a subset or 2. The anatomy modifiers (NEmod ) of s1 include equal to the named entities of s1 as NE(s2 ) ⊆ an antonym (ANT) of the anatomy modifier of NE(s1 ). s2 as NEmod (s1 ) ∩ ANT(NEmod (s2 )) 6= ∅. We used BERTScore (Zhang et al., 2020a) as a sim- Anatomy modifiers are extracted with the clinical ilarity metric and set the threshold to sim(s1 , s2 ) ≥ model of Stanza and antonyms are decided using 0.76 . The clinical model of Stanza (Zhang et al., WordNet (Fellbaum, 1998). Antonyms in anatomy 2020c) is used to extract anatomy entities and ob- modifiers are considered in this rule to differentiate servation entities. s1 and s2 are conditioned to be experessions like left vs right and upper vs lower. both negated or both non-negated. The negation The following is an example of a sentence pair that is determined with a negation identifier or the ex- matches N2 with antonyms in bold: istence of uncertain entity, using NegBio (Peng et al., 2018) as the negation identifier and the clin- s1 Moreover, a small left pleural effusion has ical model of Stanza is used to extract uncertain newly occurred. entities. s2 is further restricted to include at least 2 entities as |NE(s2 )| ≥ 2. These similarity metric, s2 Small right pleural effusion has worsened. named entity recognition model, and entity number Neutral Rule 3: N3 restriction are used in the latter neutral and contra- diction rules. The negation restriction is used in 1. The named entity types (NEtype ) of s1 are the neutral rules but is not used in the contradiction equal to the named entity types of s2 as rule. The following is an example of a sentence NEtype (s1 ) = NEtype (s2 ). pair that matches E1 with entities in bold: 2. The named entities of s1 is different from the s1 The heart is mildly enlarged. named entities of s2 as NE(s1 ) ∩ NE(s2 ) = ∅. s2 The heart appears again mild-to-moderately Specific entity types that we used are anatomy and enlarged. observation. This rule ensures that s1 and s2 have related but different entities in same types. The Neutral Rule 1: N1 following is an example of a sentence pair that 1. s1 and s2 are semantically similar. matches N3 with entities in bold: 2. The named entities of s1 is a subset of the s1 There is minimal bilateral lower lobe atelecta- named entities of s2 as NE(s1 ) ( NE(s2 ). sis. Since s1 is a premise, this condition denotes that s2 The cardiac silhouette is moderately en- the counterpart hypothesis has entities that are not larged. 6 distilbert-base-uncased with the baseline score is used as the model of BERTScore for a fast comparison and a Neutral Rule 4: N4 smooth score scale. We swept the threshold value from {0.6, 0.7, 0.8, 0.9} and set it to 0.7 as a relaxed boundary 1. The named entities of s1 are equal to the to balance between accuracy and diversity. named entities of s2 as NE(s1 ) = NE(s2 ). 5300
2. s1 and s2 include observation keywords A.3 Configuration of Radiology NLI Model (KEY) that belong to different groups as We used bert-base-uncased as a pre-trained BERT KEY(s1 ) 6= KEY(s2 ). model and further fine-tuned it on MIMIC-III The groups of observation keywords are setup (Johnson et al., 2016) radiology reports with a following the observation keywords of CheX- masked language modeling loss for 8 epochs. The pert labeler (Irvin et al., 2019). Specifi- model is further optimized on the training data cally, G1 = {normal, unremarkable}, G2 = with a classification negative log likelihood loss. {stable, unchanged}, and G3 = {clear} are We used Adam (Kingma and Ba, 2015) as an opti- used to determine words included in different mization method with β1 = 0.9, β2 = 0.999, batch groups as neutral relation. The following is an size of 16, and the gradient clipping norm of 5.0. example of a sentence pair that matches N4 with The learning rate is set to lr = 1e−5 by running keywords in bold: a preliminary experiment with lr = {1e−5 , 2e−5 }. The model is optimized for the maximum of 20 s1 Normal cardiomediastinal silhouette. epochs and a validation accuracy is used to decide s2 Cardiomediastinal silhouette is unchanged. a model checkpoint that is used to evaluate the test set. We trained the model with a single Nvidia Ti- Contradiction Rule: C1 tan XP taking approximately 2 hours to complete 1. The named entities of s1 is a subset or equal to 20 epochs. the named entities of s2 as NE(s2 ) ⊆ NE(s1 ). B Configurations of Radiology Report 2. s1 or s2 is a negated sentence. Generation Models Negation is determined with the same approach as E1. The following is an example of a sentence pair B.1 M2 Trans that matches C1 with entities in bold: We used DenseNet-121 (Huang et al., 2017) as a CNN image feature extractor and pre-trained it on s1 There are also small bilateral pleural effu- CheXpert dataset with the 14-class classification sions. setting. We used GloVe (Pennington et al., 2014) s2 No pleural effusions. to pre-train text embeddings and the pre-trainings A.2 Validation and Test Datasets of were done on a training set with the embedding Radiology NLI size of 512. The parameters of the model is set up to the dimensionality of 512, the number of We sampled 480 sentence pairs that satisfy the fol- heads to 8, and the number of memory vector to lowing conditions from the validation section of 40. We set the number of Transformer layer to MIMIC-CXR: nlayer = 1 by running a preliminary experiment 1. Two sentences (s1 and s2 ) have with nlayer = {1, 2, 3}. The model is first trained BERTScore(s1 , s2 ) ≥ 0.5. against NLL loss using the learning rate sched- uler of Transformer (Devlin et al., 2019) with the 2. MedNLI labels are equally distributed over warm-up steps of 20000 and is further optimized three labels: entailment, neutral, and contra- with a joint loss with the fixed learning rate of diction7 . 5e−6 . Adam is used as an optimization method These conditions are introduced to reduce neutral with β1 = 0.9 and β2 = 0.999. The batch size is pairs since most pairs will be neutral with random set to 48 for NLL loss and 24 for the joint losses. sampling. The sampled pairs are annotated twice For λ∗ , we first swept the optimal value of λ1 swapping its premise and hypothesis by two ex- from {0.03, 0.02, 0.01, 0.001} using the develop- perts: one medical expert and one NLP expert. For ment set. We have restricted λ2 and λ3 to have pairs that the two annotators disagreed, its labels equal values in our experiments and constrined that are decided by a discussion with one additional all λ∗ values sum up to 1.0. The model is trained NLP expert. The resulting 960 bidirectional pairs with NLL loss for 32 epochs and further trained are splitted in half resulting in 480 pairs for a vali- for 32 epochs with a joint loss. Beam search with dation set and 480 pairs for a test set. the beam size of 4 is used to decode texts when 7 We used the baseline BERT model of Wu et al. (2019) to evaluating the model against a validation set or a assign MedNLI labels to the pairs. test set. We trained the model with a single Nvidia 5301
You can also read