Question Detection in Spoken Conversations Using Textual Conversations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Question Detection in Spoken Conversations Using Textual Conversations Anna Margolis and Mari Ostendorf Department of Electrical Engineering University of Washington Seattle, WA, USA {amargoli,mo}@ee.washington.edu Abstract pages. We compare the performance of a ques- tion detector trained on the text domain using lex- We investigate the use of textual Internet con- ical features with one trained on MRDA using lex- versations for detecting questions in spoken ical features and/or prosodic features. In addition, conversations. We compare the text-trained we experiment with two unsupervised domain adap- model with models trained on manually- tation methods to incorporate unlabeled MRDA ut- labeled, domain-matched spoken utterances terances into the text-based question detector. The with and without prosodic features. Over- goal is to use the unlabeled domain-matched data to all, the text-trained model achieves over 90% of the performance (measured in Area Under bridge stylistic differences as well as to incorporate the Curve) of the domain-matched model in- the prosodic features, which are unavailable in the cluding prosodic features, but does especially labeled text data. poorly on declarative questions. We describe efforts to utilize unlabeled spoken utterances 2 Related Work and prosodic features via domain adaptation. Question detection can be viewed as a subtask of speech act or dialogue act tagging, which aims 1 Introduction to label functions of utterances in conversations, with categories as question/statement/backchannel, Automatic speech recognition systems, which tran- or more specific categories such as request or com- scribe words, are often augmented by subsequent mand (e.g., Core and Allen (1997)). Previous work processing for inserting punctuation or labeling has investigated the utility of various feature types; speech acts. Both prosodic features (extracted from Boakye et al. (2009), Shriberg et al. (1998) and Stol- the acoustic signal) and lexical features (extracted cke et al. (2000) showed that prosodic features were from the word sequence) have been shown to be useful for question detection in English conversa- useful for these tasks (Shriberg et al., 1998; Kim tional speech, but (at least in the absence of recog- and Woodland, 2003; Ang et al., 2005). However, nition errors) most of the performance was achieved access to labeled speech training data is generally with words alone. There has been some previous required in order to use prosodic features. On the investigation of domain adaptation for dialogue act other hand, the Internet contains large quantities of classification, including adaptation between: differ- textual data that is already labeled with punctua- ent speech corpora (MRDA and Switchboard) (Guz tion, and which can be used to train a system us- et al., 2010), speech corpora in different languages ing lexical features. In this work, we focus on ques- (Margolis et al., 2010), and from a speech domain tion detection in the Meeting Recorder Dialog Act (MRDA/Switchboard) to text domains (emails and corpus (MRDA) (Shriberg et al., 2004), using text forums) (Jeong et al., 2009). These works did sentences with question marks in Wikipedia “talk” not use prosodic features, although Venkataraman 118 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:shortpapers, pages 118–124, Portland, Oregon, June 19-24, 2011. c 2011 Association for Computational Linguistics
et al. (2003) included prosodic features in a semi- we compare two domain adaptation approaches to supervised learning approach for dialogue act la- utilize unlabeled speech data: bootstrapping, and beling within a single spoken domain. Also rele- Blitzer et al.’s Structural Correspondence Learning vant is the work of Moniz et al. (2011), who com- (SCL) (Blitzer et al., 2006). SCL is a feature- pared question types in different Portuguese cor- learning method that uses unlabeled data from both pora, including text and speech. For question de- domains. Although it has been applied to several tection on speech, they compared performance of a NLP tasks, to our knowledge we are the first to apply lexical model trained with newspaper text to models SCL to both lexical and prosodic features in order to trained with speech including acoustic and prosodic adapt from text to speech. features, where the speech-trained model also uti- lized the text-based model predictions as a feature. 3 Experiments They reported that the lexical model mainly iden- 3.1 Data tified wh questions, while the speech data helped identify yes-no and tag questions, although results The Wiki talk pages consist of threaded posts by for specific categories were not included. different authors about a particular Wikipedia entry. Question detection is related to the task of auto- While these lack certain properties of spontaneous matic punctuation annotation, for which the contri- speech (such as backchannels, disfluencies, and in- butions of lexical and prosodic features have been terruptions), they are more conversational than news explored in other works, e.g. Christensen et al. articles, containing utterances such as: “Are you se- (2001) and Huang and Zweig (2002). Kim and rious?” or “Hey, that’s a really good point.” We Woodland (2003) and Liu et al. (2006) used auxil- first cleaned the posts (to remove URLs, images, iary text corpora to train lexical models for punc- signatures, Wiki markup, and duplicate posts) and tuation annotation or sentence segmentation, which then performed automatic segmentation of the posts were used along with speech-trained prosodic mod- into sentences using MXTERMINATOR (Reynar els; the text corpora consisted of broadcast news or and Ratnaparkhi, 1997). We labeled each sentence telephone conversation transcripts. More recently, ending in a question mark (followed optionally by Gravano et al. (2009) used lexical models built from other punctuation) as a question; we also included web news articles on broadcast news speech, and parentheticals ending in question marks. All other compared their performance on written news; Shen sentences were labeled as non-questions. We then et al. (2009) trained models on an online encyclo- removed all punctuation and capitalization from the pedia, for punctuation annotation of news podcasts. resulting sentences and performed some additional Web text was also used in a domain adaptation text normalization to match the MRDA transcripts, strategy for prosodic phrase prediction in news text such as number and date expansion. (Chen et al., 2010). For the MRDA corpus, we use the manually- In our work, we focus on spontaneous conversa- transcribed sentences with utterance time align- tional speech, and utilize a web text source that is ments. The corpus has been hand-annotated with somewhat matched in style: both domains consist of detailed dialogue act tags, using a hierarchical la- goal-directed multi-party conversations. We focus beling scheme in which each utterance receives one specifically on question detection in pre-segmented “general” label plus a variable number of “specific” utterances. This differs from punctuation annota- labels (Dhillon et al., 2004). In this work we are tion or segmentation, which is usually seen as a se- only looking at the problem of discriminating ques- quence tagging or classification task at word bound- tions from non-questions; we consider as questions aries, and uses mostly local features. Our focus also all complete utterances labeled with one of the gen- allows us to clearly analyze the performance on dif- eral labels wh, yes-no, open-ended, or, or-after-yes- ferent question types, in isolation from segmenta- no, or rhetorical question. (To derive the question tion issues. We compare performance of textual- categories below, we also consider the specific la- and speech-trained lexical models, and examine the bels tag and declarative, which are appended to one detection accuracy of each question type. Finally, of the general labels.) All remaining utterances, in- 119
cluding backchannels and incomplete questions, are and number of words per frame. All 16 features considered as non-questions, although we removed were z-normalized using speaker-level parameters, utterances that are very short (less than 200ms), have or gender-level parameters if the speaker had less no transcribed words, or are missing segmentation than 10 utterances. times or dialogue act label. We performed minor text For all experiments we used logistic regression normalization on the transcriptions, such as mapping models trained with the LIBLINEAR package (Fan all word fragments to a single token. et al., 2008). Prosodic and lexical features were The Wiki training set consists of close to 46k combined by concatenation into a single feature vec- utterances, with 8.0% questions. We derived an tor; prosodic features and the number-of-words were MRDA training set of the same size from the train- z-normalized to place them roughly on the same ing division of the original corpus; it consists of scale as the binary ngram features. (We substituted 0 6.6% questions. For the adaptation experiments, we for missing prosody features due to, e.g., no voiced used the full MRDA training set of 72k utterances frames detected, segmentation errors, utterance too as unlabeled adaptation data. We used two meet- short.) Our setup is similar to (Surendran and ings (3k utterances) from the original MRDA devel- Levow, 2006), who combined ngram and prosodic opment set for model selection and parameter tun- features for dialogue act classification using a lin- ing. The remaining meetings (in the original devel- ear SVM. Since ours is a detection problem, with opment and test divisions; 26k utterances) were used questions much less frequent than non-questions, as our test set. we present results in terms of ROC curves, which were computed from the probability scores of the 3.2 Features and Classifier classifier. The cost parameter C was tuned to opti- Lexical features consisted of unigrams through tri- mize Area Under the Curve (AUC) on the develop- grams including start- and end-utterance tags, repre- ment set (C = 0.01 for prosodic features only and sented as binary features (presence/absence), plus a C = 0.1 in all other cases.) total-number-of-words feature. All ngram features were required to occur at least twice in the training 3.3 Baseline Results set. The MRDA training set contained on the order Figure 1 shows the ROC curves for the baseline of 65k ngram features while the Wiki training set Wiki-trained lexical system and the MRDA-trained contained over 205k. Although some previous work systems with different feature sets. Table 2 com- has used part-of-speech or parse features in related pares performance across different question cate- tasks, Boakye et al. (2009) showed no clear benefit gories at a fixed false positive rate (16.7%) near the of these features for question detection on MRDA equal error rate of the MRDA (lex) case. For analy- beyond the ngram features. sis purposes we defined the categories in Table 2 as We extracted 16 prosody features from the speech follows: tag includes any yes-no question given the waveforms defined by the given utterance times, us- additional tag label; declarative includes any ques- ing stylized F0 contours computed based on Sönmez tion category given the declarative label that is not et al. (1998) and Lei (2006). The features are de- a tag question; the remaining categories (yes-no, or, signed to be useful for detecting questions and are etc.) include utterances in those categories but not similar or identical to some of those in Boakye et included in declarative or tag. Table 1 gives exam- al. (2009) or Shriberg et al. (1998). They include: ple sentences for each category. F0 statistics (mean, stdev, max, min) computed over As expected, the Wiki-trained system does worst the whole utterance and over the last 200ms; slopes on declarative, which have the syntactic form of computed from a linear regression to the F0 contour statements. For the MRDA-trained system, prosody (over the whole utterance and last 200ms); initial alone does best on yes-no and declarative. Along and final slope values output from the stylizer; ini- with lexical features, prosody is more useful for tial intercept value from the whole utterance linear declarative, while it appears to be somewhat re- regression; ratio of mean F0 in the last 400-200ms dundant with lexical features for yes-no. Ideally, to that in the last 200ms; number of voiced frames; such redundancy can be used together with unla- 120
yes-no did did you do that? type (count) MRDA MRDA MRDA Wiki declarative you’re not going to be around (L+P) (L) (P) (L) this afternoon? yes-no (526) 89.4 86.1 59.3 77.2 wh what do you mean um reference declar. (417) 69.8 59.2 49.4 25.9 frames? wh (415) 95.4 93.0 42.2 92.8 tag you know? tag (358) 89.7 90.5 26.0 79.1 rhetorical why why don’t we do that? rhetorical (75) 88.0 90.7 25.3 93.3 open-ended do we have anything else to say open-ended (50) 88.0 92.0 16.0 80.0 about transcription? or (38) 97.4 100 29.0 89.5 or and @frag@ did they use sig- or-after-YN (32) 96.9 96.9 25.0 90.6 moid or a softmax type thing? Table 2: Question detection rates (%) by question type for or-after-YN or should i collect it all? each system (L=lexical features, P=prosodic features.) Detection rates are given at a false positive rate of 16.7% Table 1: Examples for each MRDA question category as (starred points in Figure 1), which is the equal error rate defined in this paper, based on Dhillon et al. (2004). point for the MRDA (L) system. Boldface gives best re- sult for each type. beled spoken utterances to incorporate prosodic fea- tures into the Wiki system, which may improve de- type (count) baseline bootstrap SCL tection of some kinds of questions. yes-no (526) 77.2 81.4 83.5 declar. (417) 25.9 30.5 32.1 1 wh (415) 92.8 92.8 93.5 0.925 0.912 tag (358) 79.1 79.3 80.7 0.8 0.833 rhetorical (75) 93.3 88.0 92.0 open-ended (50) 80.0 76.0 80.0 0.696 detection rate 0.6 or (38) 89.5 89.5 89.5 or-after-YN (32) 90.6 90.6 90.6 0.4 Table 3: Adaptation performance by question type, at train meetings (lex+pros) false positive rate of 16.7% (starred points in Figure 2.) 0.2 train meetings (lex only) Boldface indicates adaptation results better than baseline; train meetings (pros only) italics indicate worse than baseline. train wiki (lex only) 0 0 0.2 0.4 0.6 0.8 1 false pos rate available only in the bootstrapped MRDA data, we Figure 1: ROC curves with AUC values for question de- simply add 16 zeros onto the Wiki examples in place tection on MRDA; comparison between systems trained of the missing prosodic features. The values k = 20 on MRDA using lexical and/or prosodic features, and and r = 6 were selected on the dev set. Wiki talk pages using lexical features. In contrast with bootstrapping, SCL (Blitzer et al., 2006) uses the unlabeled target data to learn domain- independent features. SCL has generated much in- 3.4 Adaptation Results terest lately because of the ability to incorporate fea- For bootstrapping, we first train an initial baseline tures not seen in the training data. The main idea is classifier using the Wiki training data, then use it to to use unlabeled data in both domains to learn linear label MRDA data from the unlabeled adaptation set. predictors for many “auxiliary” tasks, which should We select the k most confident examples for each be somewhat related to the task of interest. In par- of the two classes and add them to the training set ticular, if x is a row vector representing the original using the guessed labels, then retrain the classifier feature vector and yi represents the label for auxil- using the new training set. This is repeated for r iary task i, the linear predictor wi is learned to pre- rounds. In order to use prosodic features, which are dict ŷi = wi · x′ (where x′ is a modified version of 121
x that excludes any features completely predictive to adaptation; but note that the adaptation process of yi .) The learned predictors for all tasks {wi } are did not use labeled MRDA data to train, but merely then collected into the columns of a matrix W, on for model selection. Analysis of the adapted sys- which singular value decomposition USVT = W tems suggests prosody features are being utilized to is performed. Ideally, features that behave simi- improve performance in both methods, but clearly larly across many yi will be represented in the same the effect is small, and the need to tune parame- singular vector; thus, the auxiliary tasks can tie to- ters would present a challenge if no labeled speech gether features which may never occur together in data were available. Finally, while the benefit from the same example. Projection of the original feature 3k labeled MRDA utterances added to the Wiki ut- vector onto the top h left singular vectors gives an terances is encouraging, we found that most of the h−dimensional feature vector z ≡ UT1:h · x′ . The MRDA training utterances (with prosodic features) model is then trained on the concatenated feature had to be added to match the MRDA-only result in representation [x, z] using the labeled source data. Figure 1, although perhaps training separate lexical As auxiliary tasks yi , we identify all initial words and prosodic models would be useful in this respect. that begin an utterance at least 5 times in each do- main’s training set, and predict the presence of each 4 Conclusion initial word (yi = 0 or 1). The idea of using the initial words is that they may be related to the inter- This work explored the use of conversational web rogative status of an utterance— utterances starting text to detect questions in conversational speech. with “do” or “what” are more often questions, while We found that the web text does especially poorly those starting with “i” are usually not. There were on declarative questions, which can potentially be about 250 auxiliary tasks. The prediction features x′ improved using prosodic features. Unsupervised used in SCL include all ngrams occuring at least 5 adaptation methods utilizing unlabeled speech and times in the unlabeled Wiki or MRDA data, except a small labeled development set are shown to im- those over the first word, as well as prosody features prove performance slightly, although training with (which are zero in the Wiki data.) We tuned h = 100 the small development set leads to bigger gains. and the scale factor of z (to 1) on the dev set. Our work suggests approaches for combining large Figure 2 compares the results using the boot- amounts of “naturally” annotated web text with strapping and SCL approaches, and the baseline un- unannotated speech data, which could be useful in adapted Wiki system. Table 3 shows results by ques- other spoken language processing tasks, e.g. sen- tion type at the fixed false positive point chosen tence segmentation or emphasis detection. for analysis. At this point, both adaptation meth- ods improved detection of declarative and yes-no 1 questions, although they decreased detection of sev- eral other types. Note that we also experimented 0.884 0.8 0.859 with other adaptation approaches on the dev set: 0.850 bootstrapping without the prosodic features did not 0.833 detection rate 0.6 lead to an improvement, nor did training on Wiki using “fake” prosody features predicted based on MRDA examples. We also tried a co-training ap- 0.4 proach using separate prosodic and lexical classi- SCL fiers, inspired by the work of Guz et al. (2007) on 0.2 bootstrap baseline (no adapt) semi-supervised sentence segmentation; this led to include MRDA dev a smaller improvement than bootstrapping. Since 0 0 0.2 0.4 0.6 0.8 1 we tuned and selected adaptation methods on the false pos rate MRDA dev set, we compare to training with the la- beled MRDA dev (with prosodic features) and Wiki Figure 2: ROC curves and AUC values for adaptation, baseline Wiki, and Wiki + MRDA dev. data together. This gives superior results compared 122
References Minwoo Jeong, Chin-Yew Lin, and Gary G. Lee. 2009. Semi-supervised speech act recognition in emails and Jeremy Ang, Yang Liu, and Elizabeth Shriberg. 2005. forums. In Proceedings of the 2009 Conference on Automatic dialog act segmentation and classification Empirical Methods in Natural Language Processing, in multiparty meetings. In Proc. Int. Conference on pages 1250–1259, Singapore, August. Association for Acoustics, Speech, and Signal Processing. Computational Linguistics. John Blitzer, Ryan McDonald, and Fernando Pereira. Ji-Hwan Kim and Philip C. Woodland. 2003. A 2006. Domain adaptation with structural correspon- combined punctuation generation and speech recog- dence learning. In Proceedings of the 2006 Confer- nition system and its performance enhancement us- ence on Empirical Methods in Natural Language Pro- ing prosody. Speech Communication, 41(4):563–577, cessing, pages 120–128, Sydney, Australia, July. As- November. sociation for Computational Linguistics. Xin Lei. 2006. Modeling lexical tones for Man- Kofi Boakye, Benoit Favre, and Dilek Hakkini-tür. 2009. darin large vocabulary continuous speech recognition. Any questions? Automatic question detection in meet- Ph.D. thesis, Department of Electrical Engineering, ings. In Proc. IEEE Workshop on Automatic Speech University of Washington. Recognition and Understanding. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Zhigang Chen, Guoping Hu, and Wei Jiang. 2010. Im- Hillard, Mari Ostendorf, and Mary Harper. 2006. proving prosodic phrase prediction by unsupervised Enriching speech recognition with automatic detec- adaptation and syntactic features extraction. In Proc. tion of sentence boundaries and disfluencies. IEEE Interspeech. Trans. Audio, Speech, and Language Processing, Heidi Christensen, Yoshihiko Gotoh, and Steve Renals. 14(5):1526–1540, September. 2001. Punctuation annotation using statistical prosody Anna Margolis, Karen Livescu, and Mari Ostendorf. models. In in Proc. ISCA Workshop on Prosody in 2010. Domain adaptation with unlabeled data for dia- Speech Recognition and Understanding, pages 35–40. log act tagging. In Proceedings of the 2010 Workshop Mark G. Core and James F. Allen. 1997. Coding dialogs on Domain Adaptation for Natural Language Process- with the DAMSL annotation scheme. In Proc. of the ing, pages 45–52, Uppsala, Sweden, July. Association Working Notes of the AAAI Fall Symposium on Com- for Computational Linguistics. municative Action in Humans and Machines, Cam- Helena Moniz, Fernando Batista, Isabel Trancoso, and bridge, MA, November. Ana Mata. 2011. Analysis of interrogatives in dif- Rajdip Dhillon, Sonali Bhagat, Hannah Carvey, and Eliz- ferent domains. In Toward Autonomous, Adaptive, abeth Shriberg. 2004. Meeting recorder project: Di- and Context-Aware Multimodal Interfaces. Theoret- alog act labeling guide. Technical report, ICSI Tech. ical and Practical Issues, volume 6456 of Lecture Report. Notes in Computer Science, chapter 12, pages 134– 146. Springer Berlin / Heidelberg. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A li- maximum entropy approach to identifying sentence brary for large linear classification. Journal of Ma- boundaries. In Proc. 5th Conf. on Applied Natural chine Learning Research, 9:1871–1874, August. Language Processing, April. Agustin Gravano, Martin Jansche, and Michiel Bacchi- Wenzhu Shen, Roger P. Yu, Frank Seide, and Ji Wu. ani. 2009. Restoring punctuation and capitalization in 2009. Automatic punctuation generation for speech. transcribed speech. In Proc. Int. Conference on Acous- In Proc. IEEE Workshop on Automatic Speech Recog- tics, Speech, and Signal Processing. nition and Understanding, pages 586–589, December. Umit Guz, Sébastien Cuendet, Dilek Hakkani-Tür, and Elizabeth Shriberg, Rebecca Bates, Andreas Stolcke, Gokhan Tur. 2007. Co-training using prosodic and Paul Taylor, Daniel Jurafsky, Klaus Ries, Noah Coc- lexical information for sentence segmentation. In caro, Rachel Martin, Marie Meteer, and Carol Van Ess- Proc. Interspeech. Dykema. 1998. Can prosody aid the automatic classi- Umit Guz, Gokhan Tur, Dilek Hakkani-Tür, and fication of dialog acts in conversational speech? Lan- Sébastien Cuendet. 2010. Cascaded model adaptation guage and Speech (Special Double Issue on Prosody for dialog act segmentation and tagging. Computer and Conversation), 41(3-4):439–487. Speech & Language, 24(2):289–306, April. Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, Jeremy Jing Huang and Geoffrey Zweig. 2002. Maximum en- Ang, and Hannah Carvey. 2004. The ICSI meet- tropy model for punctuation annotation from speech. ing recorder dialog act (MRDA) corpus. In Proc. of In Proc. Int. Conference on Spoken Language Process- the 5th SIGdial Workshop on Discourse and Dialogue, ing, pages 917–920. pages 97–100. 123
Kemal Sönmez, Elizabeth Shriberg, Larry Heck, and Mitchel Weintraub. 1998. Modeling dynamic prosodic variation for speaker verification. In Proc. Int. Conference on Spoken Language Processing, pages 3189–3192. Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza- beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26:339–373. Dinoj Surendran and Gina-Anne Levow. 2006. Dialog act tagging with support vector machines and hidden Markov models. In Proc. Interspeech, pages 1950– 1953. Anand Venkataraman, Luciana Ferrer, Andreas Stolcke, and Elizabeth Shriberg. 2003. Training a prosody- based dialog act tagger from unlabeled data. In Proc. Int. Conference on Acoustics, Speech, and Signal Pro- cessing, volume 1, pages 272–275, April. 124
You can also read