Active Learning for New Domains in Natural Language Understanding
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Active Learning for New Domains in Natural Language Understanding Stanislav Peshterliev, John Kearney, Abhyuday Jagannatha, Imre Kiss, Spyros Matsoukas Alexa Machine Learning, Amazon.com {stanislp,jkearn,abhyudj,ikiss,matsouka}@amazon.com Abstract Diverse, annotated training data collected from IVA users, or “annotated live utterances,” are essential We explore active learning (AL) for improving for these models to achieve good performance. As the accuracy of new domains in a natural lan- guage understanding (NLU) system. We pro- such, new domains frequently exhibit suboptimal pose an algorithm called Majority-CRF that performance due to a lack of annotated live ut- uses an ensemble of classification models to terances. While an initial training dataset can be guide the selection of relevant utterances, as bootstrapped using grammar generated utterances well as a sequence labeling model to help and crowdsourced collection (Amazon Mechanical prioritize informative examples. Experiments Turk), the performance that can be achieved using with three domains show that Majority-CRF these approaches is limited because of the unex- achieves 6.6%-9% relative error rate reduc- pected discrepancies between anticipated and live tion compared to random sampling with the same annotation budget, and statistically sig- usage. Thus, a mechanism is required to select live nificant improvements compared to other AL utterances to be manually annotated for enriching approaches. Additionally, case studies with the training dataset. human-in-the-loop AL on six new domains Random sampling is a common method for se- show 4.6%-9% improvement on an existing lecting live utterances for annotation. However, NLU system. in an IVA setting with many users, the number of available live utterances is vast. Meanwhile, due 1 Introduction to the high cost of manual annotation, only a small Intelligent voice assistants (IVA) such as Amazon percentage of utterances can be annotated. As such, Alexa, Apple Siri, Google Assistant, and Microsoft in a random sample of live data, the number of Cortana, are becoming increasingly popular. For utterances relevant to new domains may be small. IVA, natural language understanding (NLU) is a Moreover, those utterances may not be informative, main component (De Mori et al., 2008), in conjunc- where informative utterances are those that, if an- tion with automatic speech recognition (ASR) and notated and added to the training data, reduce the dialog management (DM). ASR converts user’s error rates of the NLU system. Thus, for new do- speech to text. Then, the text is passed to NLU mains, we want a sampling procedure which selects for classifying the action or “intent” that the user utterances that are both relevant and informative. wants to invoke (e.g., PlayMusicIntent, TurnOn- Active learning (AL) (Settles, 2009) refers to Intent, BuyItemIntent) and recognizing named- machine learning methods that can interact with entities (e.g., Artist, Genre, City). Based on the the sampling procedure and guide the selection of NLU output, DM decides the appropriate response, data for annotation. In this work, we explore using which could be starting a song playback or turning AL for live utterance selection for new domains off lights. NLU systems for IVA support function- in NLU. Authors have successfully applied AL ality in a wide range of domains, such as music, techniques to NLU systems with little annotated weather, and traffic. Also, an important require- data overall (Tur et al., 2003; Shen et al., 2004). ment is the ability to add support for new domains. The difference with our work is that, to the best of The NLU models for Intent Classification (IC) our knowledge, there is little published AL research and Named Entity Recognition (NER) use machine that focuses on data selection explicitly targeting learning to recognize variation in natural language. new domains. 90 Proceedings of NAACL-HLT 2019, pages 90–96 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics
We compare the efficacy of least- that can balance exploitation (sampling around the confidence (Lewis and Catlett, 1994) and decision boundary) and exploration (random sam- query-by-committee (Freund et al., 1997) AL pling) by reallocating the sampling budget between for new domains. Moreover, we propose an the two. In our setting, we start with a represen- AL algorithm called Majority-CRF, designed to tative seed dataset, then we iteratively select and improve both IC and NER of an NLU system. annotate small batches of data that are used as feed- Majority-CRF uses an ensemble of classification back in subsequent selections, such that extensive models to guide the selection of relevant utterances, exploration is not required. as well as a sequence labeling model to help To improve AL, Hong-Kwang and Vaib- prioritize informative examples. Simulation hava (Kuo and Goel, 2005) proposed to exploit the experiments on three different new domains show similarity between instances. Their results show that Majority-CRF achieves 6.6%-9% relative improvements over simple confidence-based selec- improvements in-domain compared to random tion for data sizes of less than 5,000 utterances. A sampling, as well as significant improvements computational limitation of the approach is that it compared to other active learning approaches. requires computing the pairwise utterance similar- ity, an O(N 2 ) operation that is slow for millions of 2 Related Work utterances available in production IVA. However, their approach could be potentially sped-up with Expected model change (Settles et al., 2008) and techniques like locality-sensitive hashing. expected error reduction (Roy and McCallum, 2001) are AL approaches based on decision the- 3 Active Learning For New Domains ory. Expected model change tries to select utter- ances that cause the greatest change on the model. We first discuss random sampling baselines and Similarly, expected error reduction tries to select standard active learning approaches. Then, we de- utterances that are going to maximally reduce gen- scribe the Majority-CRF algorithm and the other eralization error. Both methods provide sophisti- AL algorithms that we tested. cated ways for ascertaining the value of annotating 3.1 Random Sampling Baselines an utterance. However, they require computing an expectation across all possible ways to label the A common strategy to select live utterances for utterance, which is computationally expensive for annotation is random sampling. We consider two NER and IC models with many labels and millions baselines: uniform random sampling and domain of parameters. Instead, approaches to AL for NLU random sampling. generally require finding a proxy, such as model Uniform random sampling is widespread be- uncertainty, to estimate the value of getting specific cause it provides unbiased samples of the live ut- points annotated. terance distribution. However, the samples contain Tur et al. studied least-confidence and query-by- fewer utterances for new domains because of their committee disagreement AL approaches for reduc- low usage frequency. Thus, under a limited an- ing the annotation effort (Tur et al., 2005, 2003). notation budget, accuracy improvements on new Both performed better than random sampling, and domains are limited. the authors concluded that the overall annotation Domain random sampling uses the predicted effort could be halved. We investigate both of these NLU domain to provide samples of live utterances approaches, but also a variety of new algorithms more relevant to the target domains. However, this that build upon these basic ideas. approach does not select the most informative ut- Schutze et al. (Schütze et al., 2006) showed that terances. AL is susceptible to the missed cluster effect when selection focuses only on low confidence exam- 3.2 Active Learning Baselines ples around the existing decision boundary, missing AL algorithms can select relevant and informa- important clusters of data that receive high confi- tive utterances for annotation. Two popular AL dence. They conclude that AL may produce a sub- approaches are least-confidence and query-by- optimal classifier compared to random sampling committee. with a large budget. To solve this problem Osugi et Least-confidence (Lewis and Catlett, 1994) in- al. (Osugi et al., 2005) proposed an AL algorithm volves processing live data with the NLU models 91
and prioritizing selection of the utterances with the several domains at a time, we run the selection least confidence. The intuition is that utterances procedure independently and then deduplicate the with low confidence are difficult, and “teaching” utterances before sending them for annotation. the models how they should be labeled is informa- tive. However, a weakness of this method is that Algorithm 1 Generic AL procedure that selects out-of-domain or irrelevant utterances are likely data for a target domain to be selected due to low confidence. This weak- Inputs: ness can be alleviated by looking at instances with D ← positive and negative training data medium confidence using measures such as least P ← pool of unannotated live utterances margin between the top-n hypotheses (Scheffer i ← iterations, m ← mini-batch size et al., 2001) or highest Shannon entropy (Settles Parameters: and Craven, 2008). {Mk } ← set of selection models Query-by-committee (QBC) (Freund et al., 1997) F ← filtering function uses different classifiers (e.g., SVMs, MaxEnt, Ran- S ← scoring function dom Forests) that are trained on the existing anno- tated data. Each classifier is applied independently Procedure: to every candidate and the utterances assigned the 1: repeat i iterations most diverse labels are prioritized for annotation. 2: Train selection models {Mk } on D One problem with this approach is that, depend- 3: ∀ xi ∈ P obtain prediction scores yik = ing on the model and the size of the committee, Mk (xi ) it could be computationally expensive to apply on 4: P 0 ← {xi ∈ P : F(yi0 ..yik ) } large datasets. 5: C ← {xi ∈ P 0 : m with the smallest score S(yi0 ..yik )} 3.3 Majority-CRF Algorithm 6: Send C for manual annotation Majority-CRF is a confidence-based AL algorithm 7: After annotation is done D ← D ∪ C and that uses models trained on the available NLU train- P ←P \C ing set but does not rely on predictions from the 8: until full NLU system. Its simplicity compared to a full NLU system offers several advantages. First, fast Models. We experimented with n-gram linear incremental training with the selected annotated binary classifiers trained to minimize different loss data. Second, fast predictions on millions of utter- functions: Mlg ← logistic, Mhg ← hinge, and ances. Third, the selected data is not biased to the Msq ← squared. Each classifier is trained to distin- current NLU models, which makes our approach guish between positive and negative data and learns reusable even if the models change. a different decision boundary. Note that we use the Algorithm 1 shows a generic AL procedure that raw unnormalized prediction scores {yilg , yihg , yisq } we use to implement Majority-CRF, as well as other (no sigmoid applied) that can be interpreted as dis- AL algorithms that we tested. We train an ensemble tances between the utterance xi and the classifiers of models on positive data from the target domain decision boundaries at y = 0. The classifiers are of interest (e.g., Books) and negative data that is implemented in Vowpal Wabbit (Langford et al., everything not in the target domain (e.g., Music, 2007) with {1, 2, 3}-gram features. To directly tar- Videos). Then, we use the models to filter and get the NER task, we used an additional Mcf ← prioritize a batch of utterances for annotation. After CRF, trained on the NER labels of the target do- the batch is annotated, we retrain the models with main. the new data and repeat the process. Filtering maj P function.k We experimented with To alleviate the tendency of the least-confidence F ← sgn(y ) > 0, i.e., keep only major- approaches to select irrelevant data, we add unsup- ity positive prediction from the binary classifiers, dis k P ported utterances and sentence fragments to the and F ← sgn(y ) ∈ {−1, 1}, i.e., keep only negative class training data of the AL models. This prediction where there is at least one disagreement. helps keep noisy utterances on the negative side of Scoring function. When the set of models the decision boundary, so that they can be elimi- {Mk } consists of only binary classifiers, we com- nated during filtering. Note that, when targeting bine the classifier scores using either the sum of 92
Algorithm Models {Mi } Filter F Scoring S Table 1: AL algorithms evaluated. lg, sq, hg refer to binary classifiers (com- AL-Logistic lg sgn(y lg ) > 0 y lg mittee members) trained with logistic, sgn(y k ) ∈ {−1, 1} P P k QBC-SA lg, sq, hg y squared and hinge loss functions, re- sgn(y k ) ∈ {−1, 1} spectively. y i denotes the score of com- P P k QBC-AS lg, sq, hg y Majority-SA lg, sq, hg P sgn(y k ) > 0 P k y mittee member i, pcrf denotes the con- fidence of the CRF model and plg = sgn(y k ) > 0 P P k Majority-AS lg, sq, hg y lg (1 + e−y )−1 denotes the confidence sgn(y k ) ∈ {−1, 1} plg × pcrf P QBC-CRF lg, sq, hg, CRF of the logistic classifier. In all cases, we sgn(y k ) > 0 plg × pcrf prioritize by smallest score S. P Majority-CRF lg, sq, hg, CRF absolutesPS sa ← P k |yi | or the absolute sum overall predictive performance of the NLU models. S as ← | yik |. S sa prioritizes utterances where SER as the ratio of the number of slot prediction all scores are small (i.e., close to all decision bound- errors to the total number of reference slots. Errors aries), and S as prioritizes utterances where either are insertions, substitutions and deletions. We treat all scores are small or there is large disagreement the intent misclassifications as substitution errors. between classifiers (e.g., one score is large neg- ative, another is large positive, and the third is 4.2 Simulated Active Learning small). Both S sa and S as can be seen as gener- AL requires manual annotations which are costly. alization of least-confidence to a committee of clas- Therefore, to conduct multiple controlled experi- sifiers. When the set of models {Mk } includes ments with different selection algorithms, we sim- a CRF model Mcf , we compute the score with ulated AL by taking a subset of the available an- S cg ← Pcf (i) × Plg (i), i.e., the CRF probability notated training data as the unannotated candidate Pcf (i) multiplied by the logistic classifier prob- pool, and “hiding” the annotations. As such, the ability Plg (i) = σ(yilg ), where σ is the sigmoid NLU system and AL algorithm had a small pool of function. Note that we ignore the outputs of the annotated utterances for simulated “new” domains. squared and hinge classifiers for scoring, though Then, the AL algorithm was allowed to choose rel- they are still be used for filtering. evant utterances from the simulated candidate pool. The full set of configurations we evaluated is Once an utterance is selected, its annotation is re- given in Table 1, which specifies the choice of vealed to the AL algorithm, as well as to the full parameters {Mk }, F, S used in Algorithm 1. NLU system. AL-Logistic and QBC serve as baseline AL algo- Dataset. We conducted experiments using an in- rithms. The QBC-CRF and Majority-CRF models ternal test dataset of 750K randomly sampled live combine the IC focused binary classifier scores utterances, and a training dataset of 42M utterances with the NER focused sequence labeling scores containing a combination of grammar generated and use filtering by disagreement and majority (re- and randomly sampled live utterances. The dataset spectively) to select informative utterances. To the covers 24 domains, including Music, Shopping, Lo- best of our knowledge, this is a novel architecture cal Search, Sports, Books, Cinema and Calendar. for active learning in NLU. NLU System. Our NLU system has one set of Mamitsuka et al. (Mamitsuka et al., 1998) pro- IC and NER models per domain. The IC model posed bagging to build classifier committees for predicts one of its in-domain intents or a special out- AL. Bagging refers to random sampling with re- of-domain intent which helps with domain classifi- placement of the original training data to create cation. The IC and NER predictions are ranked into diverse classifiers. We experimented with bagging a single n-best list based on model confidences (Su but found that it is not better than using different et al., 2018). We use MaxEnt (Berger et al., 1996) classifiers. models for IC and the CRF models for NER (Laf- ferty et al., 2001). 4 Experimental Results Experimental Design. We split the training data into a 12M utterances initial training set for 4.1 Evaluation Metrics IC and NER, and a 30M utterance candidate pool We use Slot Error Rate (SER) (Makhoul et al., for selection. We choose Books, Local Search, and 1999), including the intent as slot, to evaluate the Cinema as target domains to simulate the AL al- 93
Domain Train Test Examples “search in mystery books” small, around 1% relative per target domain. A sig- Books 290K 13K nificant factor for Rand-Domain’s limited improve- “read me a book” ment is that it tends to capture frequently-occurring Local “mexican food nearby” 260K 16K “pick the top bank” utterances that the NLU models can already recog- Search nize without errors. As such, all AL configurations Cinema 270K 9K “more about hulk” achieved statistically significant SER gains com- “what’s playing in theaters” pared to the random baselines. Single Model Algorithms. AL-Logistic, which Table 2: Simulated ”new“ target domains for AL exper- iments. The target domain initial training datasets are carries out a single iteration of confidence-based 90% grammar generated data. The other 21 ”non-new“ selection, exhibits a statistically significant reduc- domains have on average 550k initial training datasets tion in SER relative to Rand-Domain. Moreover, with 60% grammar generated data and 40% live data. using six iterations (i.e., i=6) further reduced SER by a statistically significant 1%-2% relative to AL- Logistic(i=1), and resulted in the selection of 200 gorithms, see Table 2. Each target domain had fewer unsupported utterances. This result demon- 550-650K utterances in the candidate pool. The strates the importance of incremental selection for rest of the 21 non-target domains have 28.5M utter- iteratively refining the selection model. ances in the candidate pool. We also added 100K sentence fragments and out-of-domain utterances Committee Algorithms. AL algorithms in- to the candidate pool, which allows us to compare corporating a committee of models outperformed the susceptibility of different algorithms to noisy or those based on single models by a statistically sig- irrelevant data. This experimental setup attempts nificant 1-2% ∆SER. The majority algorithms per- to simulate the production IVA use case where the formed slightly better than the QBC algorithms candidate pool has a large proportion of utterances and were able to collect more in-domain utterances. that belong to different domains. The absolute sum scoring function S as performed slightly better than the sum of absolutes S sa for We employed the different AL algorithms to se- both QBC and Majority. Amongst all committee lect 12K utterances per domain from the candidate algorithms, Majority-AS performed best, but the pool, for a total 36K utterance annotation budget. differences with the other committee algorithms Also, we evaluated uniform (Rand-Uniform) and are not statistically significant. domain (Rand-Domain) random sampling with the same total budget. We ran each AL configuration Committee and CRF Algorithms. AL algo- twice and average the SER scores to account for rithms incorporating a CRF model tended to out- fluctuations in selection caused by the stochasticity perform purely classification-based approaches, in- in model training. For random sampling, we ran dicating the importance of specifically targeting the each selection five times. NER task. The Majority-CRF algorithm achieves a statistically significant SER improvement of 1-2% 4.2.1 Simulated Active Learning Results compared to Majority-AS (the best configuration Table 3 shows the experimental results for the target without the CRF). Again, the disagreement-based domains Books, Local Search, and Cinema. For QBC-CRF algorithm performed worse that the ma- each experiment, we add all AL selected data (in- jority algorithm across target domains. This differ- and out-of-domain), and evaluate SER for the full ence was statistically significant on Books, but not NLU system. on Cinema and Local Search. We test for statistically significant improvements In summary, AL yields more rapid improve- using the Wilcoxon test (Hollander et al., 2013) ments not only by selecting utterances relevant to with 1000 bootstrap resamples and p-value < 0.05. the target domain but also by trying to select the Random Baselines. As expected, Rand- most informative utterances. For instance, although Uniform selected few relevant utterances for the the AL algorithms selected 40-50% false posi- target domains due to their low frequency in the tive utterances from non-target domains, whereas candidate pool. Rand-Domain selects relevant ut- Rand-Domain selected only around 20% false pos- terances for the target domains, achieving statis- itives, the AL algorithms still outperformed Rand- tically significant SER improvements compared Domain. This indicates that labeling ambiguous to Rand-Uniform. However, the overall gains are false positives helps resolve existing confusions 94
Overall Books Local Search Cinema Non-Target Algorithm Group Algorithm (i = 6) #Utt #Utt ∆SER #Utt ∆SER #Utt ∆SER #Utt Rand-Uniform 35.8K 747 1.20 672 3.37 547 0.57 33.8K Random Rand-Domain 35.7K 9853 1.52 9453 4.23 9541 1.75 06.8K AL-Logistic(i=1) 34.9K 5405 4.76 7092 6.54 5224 6.09 17.1K Single Model AL-Logistic 35.1k 5524 6.77 7709 7.24 5330 7.29 16.5K QBC-AS 35.0K 4768 7.18 7869 8.57 4706 8.72 17.6K Committee Models QBC-SA 35.0K 4705 7.12 7721 8.96 4790 7.52 17.7K Majority-AS 35.1K 5389 7.66 8013 9.07 5526 8.98 16.1K Majority-SA 35.1K 5267 7.35 8196 8.46 5193 8.42 16.4K QBC-CRF 35.1K 3653 7.44 6593 9.78 4064 10.26 20.7K Committee and CRF Majority-CRF 35.1K 6541 8.42 8552 9.92 6951 11.05 13.0K Table 3: Simulation experimental results with 36K annotation budget. ∆SER is % relative reduction is SER compared to the initial model: Books SER 30.59, Local Search SER 39.09, Cinema SER 38.71. Higher ∆SER is better. The best result is in bold, and the second best is underlined. The i = 1 means selection in a single iteration, otherwise if not specified selection is in six iterations (i = 6). Overall #Utt shows the remaining from the 36K selected after removing the sentence fragments and out-of-domain utterances. Both target and non-target domains IC and NER models are re-retrained with the new data. between domains. Another important observation positive. This is lower than the 50% in the simula- is that majority filtering F maj performs better than tion because the initial training data exhibits more QBC disagreement filtering F dis across all of our examples of the negative class. Around 10% of the experiments. A possible reason for this is that ma- AL selected data is lost due to being unactionable jority filtering selects a better balance of boundary or out-of-domain, similar to the frequency with utterances for classification and in-domain utter- which these utterances are collected by random ances for NER. Finally, the Majority-CRF results sampling. show that incorporating the CRF model improves While working with human annotators on new the performance of the committee algorithms. We domains, we observed two challenges that impact assume this is because incorporation of a CRF- the improvements from AL. First, annotators make based confidence directly targets the NER task. more mistakes on AL selected utterances as they are more ambiguous. Second, new domains may 4.3 Human-in-the-loop Active Learning have a limited amount of test data, so the impact We also performed AL for six new NLU domains of AL cannot be fully measured. Currently, we with human-in-the-loop annotators and live user address the annotation mistakes with manual data data. We used the Majority-SA configuration for clean up and transformations, but further research simplicity in these case studies. We ran the AL se- is needed to develop an automated solution. To lection for 5-10 iterations with varying batch sizes improve the coverage of the test dataset for new between 1000-2000. domains we are exploring test data selection using Domain ∆SER #Utt Selected #Utt Testset stratified sampling. Recipes 8.97 24.1K 4.7K LiveTV 6.92 11.6K 1.8K 5 Conclusions OpeningHours 7.05 6.8K 583 Navigation 4.67 6.7K 6.4K In this work, we focused on AL methods designed DropIn 9.00 5.3K 7.2K to select live data for manual annotation. The dif- Membership 7.13 4.2K 702 ference with prior work on AL is that we specifi- Table 4: AL with human annotator results. ∆SER is % cally target new domains in NLU. Our proposed relative gain compared to the existing model. Higher is Majority-CRF algorithm leads to statistically sig- better. nificant performance gains over standard AL and random sampling methods while working with a Table 4 shows the results from AL with human limited annotation budget. In simulations, our annotators. On each feature, AL improved our Majority-CRF algorithm showed an improvement existing NLU model by a statistically significant of 6.6%-9% SER relative gain compared to random 4.6%-9%. On average 25% of utterances are false sampling, as well as improvements over other AL 95
algorithms with the same annotation budget. Simi- Hinrich Schütze, Emre Velipasaoglu, and Jan O Ped- larly, results with live annotators show statistically ersen. 2006. Performance thresholding in practical text classification. significant improvements of 4.6%-9% compared to the existing NLU system. Burr Settles. 2009. Active learning literature survey. Computer sciences technical report, University of Wisconsin–Madison. References Burr Settles and Mark Craven. 2008. An analysis of Adam L Berger, Vincent J Della Pietra, and Stephen active learning strategies for sequence labeling tasks. A Della Pietra. 1996. A maximum entropy approach In Proceedings of the conference on empirical meth- to natural language processing. Computational lin- ods in natural language processing. guistics. Burr Settles, Mark Craven, and Soumya Ray. 2008. Renato De Mori, Frédéric Bechet, Dilek Hakkani-Tur, Multiple-instance active learning. In Advances in Michael McTear, Giuseppe Riccardi, and Gokhan neural information processing systems. Tur. 2008. Spoken language understanding. Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and Yoav Freund, H Sebastian Seung, Eli Shamir, and Naf- Chew-Lim Tan. 2004. Multi-criteria-based active tali Tishby. 1997. Selective sampling using the learning for named entity recognition. In Proceed- query by committee algorithm. Machine learning. ings of the 42Nd Annual Meeting on Association for Myles Hollander, Douglas A Wolfe, and Eric Chicken. Computational Linguistics. 2013. Nonparametric statistical methods. John Wi- Chengwei Su, Rahul Gupta, Shankar Ananthakrish- ley & Sons. nan, and Spyros Matsoukas. 2018. A re-ranker Hong-Kwang Jeff Kuo and Vaibhava Goel. 2005. Ac- scheme for integrating large scale nlu models. arXiv tive learning with minimum expected error for spo- preprint arXiv:1809.09605. ken language understanding. In INTERSPEECH. Gokhan Tur, Dilek Hakkani-Tür, and Robert E John Lafferty, Andrew McCallum, Fernando Pereira, Schapire. 2005. Combining active and semi- et al. 2001. Conditional random fields: Probabilistic supervised learning for spoken language understand- models for segmenting and labeling sequence data. ing. Speech Communication. In Proceedings of the eighteenth international con- ference on machine learning, ICML. Gokhan Tur, Robert E Schapire, and Dilek Hakkani- Tur. 2003. Active learning for spoken language John Langford, Lihong Li, and Alex Strehl. 2007. Vow- understanding. In Acoustics, Speech, and Signal pal Wabbit. Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on. David D Lewis and Jason Catlett. 1994. Heteroge- neous uncertainty sampling for supervised learning. In Proceedings of the eleventh international confer- ence on machine learning. John Makhoul, Francis Kubala, Richard Schwartz, and Ralph Weischedel. 1999. Performance measures for information extraction. In In Proceedings of DARPA Broadcast News Workshop. Naoki Abe Hiroshi Mamitsuka et al. 1998. Query learning strategies using boosting and bagging. In Machine learning: proceedings of the fifteenth inter- national conference (ICML’98), volume 1. Morgan Kaufmann Pub. Thomas Osugi, Deng Kim, and Stephen Scott. 2005. Balancing exploration and exploitation: A new algo- rithm for active machine learning. In Data Mining, Fifth IEEE International Conference on. Nicholas Roy and Andrew McCallum. 2001. Toward optimal active learning through monte carlo estima- tion of error reduction. ICML, Williamstown. Tobias Scheffer, Christian Decomain, and Stefan Wro- bel. 2001. Active hidden markov models for infor- mation extraction. In International Symposium on Intelligent Data Analysis. 96
You can also read