IntenT5: Search Result Diversification using Causal Language Models - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IntenT5: Search Result Diversification using Causal Language Models Sean MacAvaney, Craig Macdonald, Roderick Murray-Smith, Iadh Ounis {sean.macavaney,craig.macdonald,roderick.murray-smith,iadh.ounis}@glasgow.ac.uk University of Glasgow Glasgow, UK ABSTRACT (a) monoT5 (no diversity) 1 Penguins (order Sphenisciformes, family Spheniscidae) are a Search result diversification is a beneficial approach to overcome (A) group of aquatic, flightless birds living almost exclusively in under-specified queries, such as those that are ambiguous or multi- arXiv:2108.04026v1 [cs.IR] 9 Aug 2021 the Southern Hemisphere, especially in Antarctica... faceted. Existing approaches often rely on massive query logs and 2 Penguins are a group of aquatic, flightless birds. They live interaction data to generate a variety of possible query intents, (A) almost exclusively in the Southern Hemisphere, with only one which then can be used to re-rank documents. However, relying species, the Galapagos penguin, found north of the equator... on user interaction data is problematic because one first needs a 3 Penguins are flightless birds that are highly adapted for the massive user base to build a sufficient log; public query logs are (A) marine environment. They are excellent swimmers, and can insufficient on their own. Given the recent success of causal lan- dive to great depths, (emperor penguins can dive to over.. guage models (such as the Text-To-Text Transformer (T5) model) 4 Penguins are an iconic family of aquatic, flightless birds. Al- at text generation tasks, we explore the capacity of these models to (A) though we think of penguins as Antarctic birds some, like the generate potential query intents. We find that to encourage diver- Galapagos penguin, live in much warmer climates near... sity in the generated queries, it is beneficial to adapt the model by 5 Penguins are torpedo-shaped, flightless birds that live in the including a new Distributional Causal Language Modeling (DCLM) (A) southern regions of the Earth. Though many people imagine a objective during fine-tuning and a representation replacement dur- small, black-and-white animal when they think of penguins... ing inference. Across six standard evaluation benchmarks, we find that our method (which we call IntenT5) improves search result di- (b) monoT5 (using IntenT5 + DCLM + RS + xQuAD) versity and attains (and sometimes exceeds) the diversity obtained 1 Penguins (order Sphenisciformes, family Spheniscidae) are a when using query suggestions based on a proprietary query log. (A) group of aquatic, flightless birds living almost exclusively in Our analysis shows that our approach is most effective for multi- the Southern Hemisphere, especially in Antarctica... faceted queries and is able to generalize effectively to queries that 2 Television coverage of the event will be provided by WPXI were unseen in training data. (H) beginning at 12 p.m. ET and running through the entire event. You can also stream the parade live on the Penguins’... 3 The most abundant species of penguin is the “Macaroni” with 1 INTRODUCTION (A) an approximate population of 20 to 25 Million individuals Although contextualized language models (such as BERT [19] and while there are only around 1,800 Galapagos penguins left... T5 [42]) have been shown to be highly effective at adhoc rank- 4 Marc-Andre Fleury will get the start in goal for Pittsburgh on ing [30, 36, 37], they perform best with queries that give adequate (H) Saturday. The Penguins are in Buffalo for the season finale context, such as natural-language questions [16]. Despite the rise and need a win to clinch a playoff spot. Fleury was in goal... of more expressive querying techniques (such as in conversational 5 It is the home of the Pittsburgh Penguins, who moved from search systems [41]), keyword-based querying remains a popular (H) their former home across the street prior to the 2010-11 NHL choice for users.1 However, keyword queries can often be under- season. The CONSOL Energy Center has 18,387 seats for... specified, giving rise to multiple possible interpretations or in- tents [8]. Unlike prior lexical models, which do not account for Figure 1: Top results for the query “penguins” re-ranked word senses or usage in context, contextualized language models using monoT5 without diversity measures and using our are prone to scoring based on a single predominant sense, which IntenT5 approach on the MS MARCO passage corpus. We can hinder search result quality for under-specified queries. For manually removed near-duplicate passages from the results. instance, the results in Figure 1(a) are all similar and do not cover Note that the IntenT5 model has a better variety of top- a variety of information needs. ranked passages related to both the (A)nimal and (H)ockey. Under-specified queries can be considered ambiguous and/or faceted [12]. For ambiguous queries, intents are distinct and often 1 https://trends.google.com correspond to different word senses. For example, the query “pen- guins” may refer to either the animal or the American ice hockey Permission to make digital or hard copies of part or all of this work for personal or team (among other senses). In Figure 1, we see that the monoT5 classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation model [37] only identifies passages for the former sense in the top on the first page. Copyrights for third-party components of this work must be honored. results. In fact, the first occurrence of a document about the hockey For all other uses, contact the owner/author(s). team is ranked at position 158, likely meaning that users with this ,, © Copyright held by the owner/author(s).
e.g., DPH, BM25 User Query Initial Retrieval data. This approach simultaneously optimizes the model to gener- e.g., penguins ate all subsequent tokens that can follow a prefix, rather than just a single term, which should help the model better learn the variety of … senses that terms can exhibit. We also introduce a clustering-based Query Intent 1 e.g., schedule Representation Swapping (RS) approach that replaces the internal e.g., sQuAD, PM2 IntenT5 Query Intent 2 Scoring Aggregation term representations with a variety of possible alternate senses. e.g., population e.g., monoT5, ColBERT Qualitatively, we find that these approaches can help improve the … Search Result List diversity of ambiguous queries in isolated cases. For instance, in Query Intent n e.g., tickets Figure 1(b), multiple senses are identified and accounted for. How- ever, in aggreagte, we found insufficient evidence that they improve Figure 2: Overview of our search result diversification sys- an unmodified IntenT5 model. Nevertheless, our study opens the tem using IntenT5 to generate potential query intents. door for more research in this area and motivates the creation of larger test sets with ambiguous queries. query intent would likely need to reformulate their query to satisfy In summary, our contributions are: their information need. In a faceted query, a user may be interested • We propose using causal language models for predicting in different aspects about a given topic. In the example of “pen- query intents for search result diversification. guins”, a user may be looking for information about the animal’s • Across 6 TREC diversity benchmarks, we show that this ap- appearance, habitat, life cycle, etc., or the hockey team’s schedule, proach can outperform query intents generated from massive roster, score, etc. Here again, the monoT5 results also lack diversity amount of interaction data, and that the model effectively in terms of facets, with the top results all focusing on the habitat generalizes to previously unseen queries. and appearance. • We introduce a new distributional causal language modeling Search result diversification approaches aim to overcome this objective and a representation replacement strategy to better issue. In this setting, multiple potential query intents are predicted handle ambiguous queries. and the relevance scores for each intent are combined to provide • We provide an analysis that investigates the situations where diversity among the top results (e.g., using algorithms like IA- IntenT5 is effective, and qualitatively assess the generated Select [1], xQuAD [43] or PM2 [18]). Intents can be inferred from intents. manually-constructed hierarchies [1] or from interaction data [43], such as popular searches or reformulations. Although using inter- 2 BACKGROUND AND RELATED WORK action data is possible for large, established search engines, it is not a feasible approach for search engines that do not have massive In this section, we cover background and prior work related to user bases, nor to academic researchers, as query logs are propri- search result diversification (Section 2.1), causal language modeling etary data. Researchers have instead largely relied on search result (Section 2.2), and neural ranking (Section 2.3). suggestions from major search engines, which are black-box algo- rithms, or using the “gold” intents used for diversity evaluation [18]. 2.1 Search Result Diversification Thus, an effective approach for generating potential query intents Search result diversification techniques aim to handle ambiguous without needing a massive amount of interaction data is desirable. queries. Early works aimed to ensure that the retrieved documents Given the recent success of Causal Language Models (CLMs) addressed distinct topics - for instance, Maximal Marginal Rele- such as T5 [42] in a variety of text generation tasks, we propose vance (MMR) [4] can be used to promote documents that are rel- using these models to generate potential intents for under-specified evant to the user’s query, but are dissimilar to those document queries. Figure 2 provides an overview of our approach (IntenT5). already retrieved. In doing so, the typical conventional document We fine-tune the model on a moderately-sized collection of queries independence assumption inherent to the Probability Ranking Prin- (ORCAS [15]), and evaluate using 6 TREC diversity benchmark ciple is relaxed. Indeed, by diversifying the topics covered in the datasets. We find that our approach improves the search result top-ranked documents, diversification approaches aim to address diversity of both a lexical and neural re-ranking models, and can the risk that there are no relevant documents retrieved for the user’s even exceed the diversity performance when using Google query information need [48]. Other approaches such as IA-Select used cat- suggestions and the gold TREC intent descriptions. We also find egory hierarchies to identify documents with different intents [1]. that our approach has the biggest gains on queries that occur in- Given an under-specified (ambiguous or multi-faceted) query frequently (or never) in the collection of training queries, showing and a candidate set of documents , the potential query intents that the approach is able to generalize effectively to unseen queries. { 1, 2, .. } can be identified as sub-queries, i.e., query formulations Through analysis of our proposed IntenT5 model, we find that that more clearly identify relevant documents about a particular it has difficulty improving over plain adhoc ranking models for interpretation of the query. These intents are usually identified ambiguous queries. Indeed, we find this to be challenging for other from interaction data, such as query logs [43]. In academic research, approaches as well. In an attempt to better handle ambiguity, we query suggestions from major search engines often serve as a stand- explore two novel techniques for improving the variety of gener- in for this process [18, 43]. The candidate documents are re-scored ated intents. First, we propose a Distributional Causal Language for each of the intents. The scores from the individual intents are Modeling (DCLM) objective. This approach targets the observation then aggregated into a final re-ranking of documents, using an that a typical CLM trained for this task tends to over-predict general algorithm such as xQuAD or PM2. terms that are not specific to the query (e.g., ‘information’, ‘mean- Aggregation strategies typically attempt to balance relevance ing’, ‘history’) since these are highly-represented in the training and novelty. xQuAD [43] iteratively selects documents that exhibit
high relevance to the original query and are maximally relevant to as re-rankers; that is, an initial ranking model such as BM25 is the set of intents. As documents are selected, the relevance scores used to provide a pool of documents that can be re-ranked by the of documents to intents already shown are marginalized. The bal- neural method. Neural approaches have also been applied as first- ance between relevance to the original query and the intents are stage rankers [25, 50, 52] (also called dense retrieval). The ColBERT controlled with a parameter . PM2 [18] is an aggregation strat- model [25] scores documents based on BERT-based query and doc- egy based on a proportional representation voting scheme. This ument term representations. The similarity between the query and aggregation strategy ignores relevance to the original query, and document representations are summed to give a ranking score. This iteratively selects the intent least represented so far. The impact model can be used as both a first-stage dense ranker (through an ap- of the selected intent and the intents that are not selected when proximate search over its representations) as well as a re-ranker (to choosing the next document is controlled with a parameter (not produce precise ranking scores). We focus on re-ranking approaches to be confused with xQuAD’s ). in this work, which is in line with prior work on diversification [44]. More recently, there has been a family of work addressing learned CLMs have been used in neural ranking tasks. Nogueira et al. [37] models for diversification [20, 49, 51]; we see this as orthogonal predicted the relevance of a document to a query using the T5 to our work here, since we do not consider learned diversification model (monoT5). Pradeep et al. [40] further explored this model and approaches. Indeed, in this work, we study the process of query showed that it can be used in conjunction with a version that scores intent generation for search result diversification, rather than ag- and aggregates pairs of documents (duoT5). Both these models gregation strategies. For further information about search result only use CLMs insofar as predicting a single token (‘true’ or ‘false’ diversification, see [44]. for relevance) given the text of the query and document and a prompt. Here, the probability of ‘true’ is used as the ranking score. 2.2 Causal Language Modeling Doc2query models [38, 39] generate possible queries, conditioned on document text, to include in an inverted index. Causal Language Models (CLMs) predict the probability of a token Unlike prior neural ranking efforts, we focus on diversity ranking, given the prior tokens in the sequence: ( | −1, −2, ..., 1 ). rather than adhoc ranking. Specifically, we use a neural CLM to gen- This property makes a CLM able to generate text: by providing a erate possible query intents given a query text, which differs from prompt, the model can iteratively predict a likely sequence of follow- prior uses of CLMs in neural ranking. These query intents are then ing tokens. However, a complete search of the space is exponential used to score and re-rank documents. To our knowledge, this is the because the probability of each token depends on the preceding gen- first usage of neural ranking models for diversity ranking. For fur- erated tokens. Various strategies exist for pruning this space. A pop- ther details on neural ranking and re-ranking models, see [27, 34]. ular approach for reducing the search space is a beam search, where a fixed number of high-probability sequences are explored in paral- 3 GENERATING QUERY INTENTS lel. Alternative formulations, such as Diverse Beam Search [46] have been proposed, but we found these techniques unnecessary for short In this section, we describe our proposed IntenT5 model for query texts like queries. We refer the reader to Meister et al. [32] for more intent generation (Section 3.1), as well as two model adaptations further details about beam search and text generation strategies. intended to improve the handling of ambiguous queries: distribu- While CLMs previously accomplished modeling with recurrent tional causal language modeling (Section 3.2) and representation neural networks [24, 33], this modeling has recently been accom- swapping (Section 3.3). plished through transformer networks [17]. Networks pre-trained with a causal language modeling objective, for instance T5 [42], can 3.1 IntenT5 be an effective starting point for further task-specific training [42]. Recall that we seek to train a model that can be used to generate In the case of T5, specific tasks are also cast as sequence generation potential query intents. We formulate this task as a sequence-to- problems by encoding the source text and generating the model sequence generation problem by predicting additional terms for prediction (e.g., a label for classification tasks). the user’s initial (under-specified) query. In this work, we explore the capacity of CLMs (T5 in partic- We first fine-tune a T5 [42] model using a causal language mod- ular) for generating a diverse set of possible query intents. This eling objective over a collection of queries. Note that this training differs from common uses of CLMs due to the short nature of the approach does not require search frequency, session, or click infor- text (keyword queries rather than natural-language sentences or mation; it only requires a collection of query text. This makes a va- paragraphs) and the focus on the diversity of the predictions. riety of data sources available for training, such as the ORCAS [15] query collection. This is a desirable quality because releasing query 2.3 Neural Ranking text poses fewer risks to personal privacy than more extensive interaction information. Neural approaches have been shown to be effective for adhoc rank- Recall that for a sequence consisting of tokens, causal lan- ing tasks, especially when the intents are expressed clearly and in guage models optimize for ( | −1, −2, ..., 1 ). To generate natural language [16]. Pre-trained contextualized language models, intents, we use a beam search to identify highly-probable sequences. such as BERT [19] and ELECTRA [6] have been particularly effec- No length penalization is applied, but queries are limited to 10 gen- tive for adhoc ranking [30, 36]. A simple application of these models erated tokens.2 We apply basic filtering techniques to remove gen- is the “vanilla” setting (also called CLS, mono, and cross), where erated intents that do not provide adequate additional context. In the query and document text is jointly encoded and the model’s particular, we first remove terms that appear in the original query. classification component is tuned to provide a ranking score. Due to the expense of such approaches, neural models are typically used 2 We found that this covers the vast majority of cases for this task, in practice.
Documents (MS MARCO passages) Algorithm 1 DCLM Training Procedure Penguins are a group of aquatic, flightless birds. They live... tree ← BuildPrefixTree(corpus) Penguins are birds, not mammals. Many bird species are... repeat Penguins are carnivores with piscivorous diets, getting all... prefix ← {RandomSelect(tree.children)} Penguins are especially abundant on islands in colder climates... targets ← prefix.children Keyword Queries (ORCAS) penguins hockey while |prefix[−1].children| > 0 do penguins adaptations penguins hockey game Optimize (prefix[−1].children|prefix) penguins animals penguins hockey game tonight prefix = {prefix, RandomSelect(prefix[−1].children)} penguins animals facts penguins hockey live streaming end while until converged Figure 3: Comparison of document language and query lan- guage. Queries naturally lend themselves to a tree structure, penguins adaptation motivating our DCLM approach. penguins animal Since neural retrieval models are sensitive to word morphology [29], we only consider exact term matches for this filter. We also discard penguins animal facts … intents that are very short (less than 6 characters, e.g., “.com”), as penguins wikipedia we found that these usually carry little valuable context. Among the filtered intents, we select the top most probable sequences. Note (a) Causal Language Modeling that this generation process is fully deterministic, so the results are entirely replicable. adaptation Each retrieved document is scored for each of the generated penguins animal intents, and the intent scores are aggregated using an established … diversification algorithm, like xQuAD [43] or PM2 [18]. wikipedia Instead of T5, our approach could also be applied to other pre- trained causal language models, such as BART [26], however, we penguins animal facts leave such a study for future work. … (b) Distributional Causal Language Modeling 3.2 Distributional Causal Language Modeling Figure 4: Graphical distinction between Causal Language Typical natural language prose, such as the type of text found Modeling (CLM, (a)) and our proposed Distributional Causal in documents, lends itself well to CLM because the text quickly Language Modeling (DCLM, (b)). Given a prompt (e.g., ‘pen- diverges into a multitude of meanings. For instance, in Figure 3, guins’), a DCLM objective optimizes for all possible sub- we see that the prefix “Penguins are” diverges into a variety of sequent tokens (e.g., adaptation, animal, antarctica, etc.) sequences (e.g., “a group of”, “birds, not”, “carnivores with”, etc.). If rather than implicitly learning this distribution over numer- structured by prefixes, this results in long chains of tokens. Keyword ous training samples. queries, on the other hand, typically have a hierarchical prefix- based nature. When structured as a tree, it tends to be shallow and CEDR [30], TK [22], and ColBERT [25]) to match particular word dense. For instance, in Figure 3, a distribution of terms is likely senses. Normally, the surrounding words in a piece of text offer ade- to follow the prefix ‘penguins’ (e.g., adaptations, animals, hockey, quate context to disambiguate word senses. However, short queries etc.). Similarly, a distribution of terms follows the prefix ‘penguins inherently lack such context. For instance, in the case where a query hockey’ (e.g., game, live, news, score, etc.). contains only a single term, we find that transformer models simply Based on this observation, we propose a new variant of Causal choose a predominant sense (e.g., the animal sense for the query Language Modeling (CLM) designed for keyword queries: Distribu- penguins). When used with the IntenT5 model, we find that this tional Causal Language Modeling (DCLM). In contrast with CLM, causes the generated intents to lack diversity of word senses (we DCLM considers other texts in the source collection when building demonstrate this in Section 6). We introduce an approach we call the learning objectives through the construction of a prefix tree. In Representation Swapping (RS) to overcome this issue. other words, while CLM considers each sequence independently, RS starts by building a set of prototype representations for DCLM builds a distribution of terms that follow a given prefix. Vi- each term in a corpus.3 For a given term, a random sample of sually, the difference between the approaches are shown in Figure 4. passages from a corpus that contain the term are selected. Then, When training, the prefix tree is used to find all subsequent tokens the term’s internal representations are extracted for each passage.4 across the collection given a prefix and optimizes the output of the Note that because of the context from other terms in the passage, model to generate all these tokens (with equal probability). The 3 Inpractice, terms are filtered to only those meeting a frequency threshold, as infre- training process for DCLM is given in Algorithm 1. quent terms are less prone to being ambiguous. In an offline experimentation setting, only the terms that appear in the test queries need to be considered. 3.3 Representation Swapping 4 Note that terms are often broken down into subwords by T5’s tokenizer. In this case, we concatenate the representations of each of the constituent subwords. Although the In a transformer model – which is the underlying neural archi- size of the concatenated representations across terms may differ, the representations tecture of T5 – tokens are represented as contextualized vector for a single term are the same length. representations. These representations map tokens to a particular sense – this is exploited by several neural ranking models (like
these representations include the sense information. All of the these Table 1: Dataset statistics. representations for a given term are then clustered into clusters. A single prototype representation for each cluster is selected by # Terms # Subtopics # Pos. Qrels finding the representation that is closest to the median value across Dataset # Topics per topic (per topic) (per topic) the representations in the cluster. Using this approach, we find WT09 [8] 50 2.1 243 (4.9) 6,499 (130.0) that the sentences from which the prototypes were selected often WT10 [9] 50 2.1 218 (4.4) 9,006 (180.1) express different word senses. WT11 [10] 50 3.4 168 (3.4) 8,378 (167.6) Finally, when processing a query, the IntenT5 model is executed WT12 [11] 50 2.3 195 (3.9) 9,368 (187.4) multiple times: once with the original representation (obtained from WT13 [13] 50 3.3 134 (2.7) 9,121 (182.4) encoding the query alone), and then additional times for each WT14 [14] 50 3.3 132 (2.6) 10,629 (212.6) term. In these instances, the internal representations of a given term Total 300 2.8 1,090 (3.6) 53,001 (176.7) are swapped with the prototype representation. This allows the T5 model to essentially inherit the context from the prototype sentence for the ambiguous query, and allows the model to generate text of these datasets. These benchmarks span two corpora: WT09–12 based on different senses. This approach only needs to be applied use ClueWeb09-B (50M documents) and WT13–14 use ClueWeb12- for shorter queries, since longer queries provide enough context on B13 (52M documents). We use the keyword-based “title” queries, their own. This introduces a parameter , as the maximum query simulating a setting where the information need is under-specified. length. The final intents generated by the model are selected using To the best of our knowledge, these are the largest and most exten- a diversity aggregation algorithm like xQuAD, which ensures a sive public benchmarks for evaluating search result diversification. good mix of possible senses in the generated intents. We measure system performance with three diversification-aware variants of standard evaluation measures: -nDCG@20 [7], ERR- 4 EXPERIMENTAL SETUP IA@20 [5], and NRPB [12]. These are the official task evaluation metrics for WT10–14 (WT09 used -nDCG@20 and P-IA@20). - We experiment to answer the following research questions: nDCG is a variant of nDCG [23] that accounts for the novelty of RQ1: Can the intents generated from IntenT5 be used to diversify topics introduced. We use the default parameter (the probability of search results? an incorrect positive judgment) of 0.5 for this measure. ERR-IA is a RQ2: Do queries that appeared in the IntenT5 training data simple mean over the Expected Reciprocal Rank of each intent, since perform better than those that did not? WT09–14 weight the gold query intents uniformly. NRBP (Novelty- RQ3: Does IntenT5 perform better at ambiguous or multi-faceted and Rank-Biased Precision) is an extension of RBP [35], measuring queries? the average utility gained as a user scans the search results. Metrics RQ4: Does training with a distributional causal language model- were calculated from the official task evaluation script ndeval.6 Fur- ing objective or performing representation swapping improve the thermore, as these test collections were created before the advent quality of the generated intents? of neural ranking models, we also report the judgment rate among the top 20 results (Judged@20) to ascertain the completeness of the 4.1 IntenT5 Training and Settings relevance assessment pool in the presence of such neural models. Although IntenT5 can be trained on any moderately large collec- To test the significance of differences, we use paired t-tests with tion of queries, we train the model using the queries from the < 0.05, accounting for multiple tests where appropriate with a ORCAS [15] dataset. With 10.4M unique queries, this dataset is Bonferroni correction. In some cases, we test for significant equiva- both moderately large5 and easily accessible to researchers with- lences (i.e., that the means are the same). For these tests, we use a out signed data usage agreements. The queries in ORCAS were two one-sided test (TOST [47]) with < 0.05. Following prior work harvested from the Bing logs, filtered down to only those where using a TOST for retrieval effectiveness [31], we set the acceptable users clicked on a document found in the MS MARCO [3] document equivalence range to ±0.01. dataset. Queries in this collection contain an average of 3.3 terms, with the majority of queries containing either 2 or 3 terms. The 4.3 Baselines IntenT5 model is fine-tuned from t5-base using default parameters To put the performance of our method in context, we include several (learning rate: 5 × 10−5 , 3 epochs, Adam optimizer). adhoc and diversity baselines. As adhoc baselines, we compare with When applying RS, we use = 5, = 1, xQuAD with = 1, DPH [2], a lexical model (we found it to perform better than BM25 in and use agglomerative clustering, based on qualitative observations pilot studies), Vanilla BERT [30], monoT5 [37], and a ColBERT [25] during pilot studies. We select 1,000 passages per term from the MS re-ranker.7 Since the neural models we use have a maximum se- MARCO document corpus, so as not to potentially bias our results quence length, we apply the MaxPassage [16] scoring approach. Pas- to our test corpora. sages are constructed using sliding windows of 150 tokens (stride 75). For Vanilla BERT, we trained a model on MS MARCO using 4.2 Evaluation the original authors’ released code. For monoT5 and ColBERT, we We evaluate the effectiveness of our approach on the TREC Web use versions released by the original authors that were trained Track (WT) 2009–14 diversity benchmarks [8–11, 13], consisting of on the MS MARCO dataset [3]. This type of zero-shot transfer 300 topics and 1,090 sub-topics in total. Table 1 provides a summary 6 https://trec.nist.gov/data/web/10/ndeval.c 5 It is widely known that Google processes over 3B queries daily, with roughly 15% 7 We only use ColBERT in a re-ranking setting due to the challenge in scaling its space being completely unique. intensive dense retrieval indices to the very large ClueWeb datasets.
from MS MARCO to other datasets has generally shown to be effec- Table 2: Results over the combined TREC WebTrack 2009–14 tive [28, 37], and reduces the risk of over-fitting to the test collection. diversity benchmark datasets. -nDCG, ERR-IA, and Judged Google Suggestions. We compare with the search suggestions are computed with a rank cutoff of 20. The highest value provided by the Google search engine through their public API. in each section is listed in bold. PM and xQ indicate the Though the precise details of the system are private, public informa- PM2 and xQuAD aggregators, respectively. Statistically sig- tion states that interaction data plays a big role in the generation of nificant differences between each value and the correspond- their search suggestions [45], meaning that this is a strong baseline ing non-diversified baseline are indicated by * (paired t-test, for an approach based on a query log. Furthermore, this technique Bonferroni correction, < 0.05). was used extensively in prior search result diversification work as a source of query intents [18, 43]. Note that the suggestions are System Agg. -nDCG ERR-IA NRBP Judged sensitive to language, geographic location and current trends; we DPH 0.3969 0.3078 0.2690 62% use the suggestions in English for United States (since the TREC + IntenT5 PM * 0.4213 * 0.3400 * 0.3053 56% assessors were based in the US); we will release a copy of these + Google Sug. PM * 0.4232 * 0.3360 * 0.3000 58% suggestions for reproducibility. + Gold PM * 0.4566 * 0.3629 * 0.3254 53% Gold Intents. We also compare with systems that use the “Gold” + IntenT5 xQ 0.4049 0.3213 0.2845 58% intents provided by the TREC task. Note that this is not a realistic + Google Sug. xQ * 0.4203 * 0.3326 0.2963 59% system, as these intents represent the evaluation criteria and are not + Gold xQ * 0.4545 * 0.3616 * 0.3230 56% known a priori. Furthermore, the text of these intents are provided Vanilla BERT 0.3790 0.2824 0.2399 54% in natural language (similar to TREC description queries), unlike + IntenT5 PM * 0.4364 * 0.3475 * 0.3104 60% the keyword-based queries they elaborate upon. Hence, these Gold + Google Sug. PM * 0.4110 * 0.3119 0.2689 57% intents are often reported to represent a potential upperbound on di- + Gold PM * 0.4214 * 0.3214 * 0.2803 50% versification effectiveness, however, later we will show that intents + IntenT5 xQ * 0.4328 * 0.3448 * 0.3085 59% generated by IntenT5 can actually outperform these Gold intents.8 + Google Sug. xQ * 0.4120 * 0.3140 * 0.2722 59% + Gold xQ * 0.4228 * 0.3236 * 0.2813 55% 4.4 Model Variants and Parameter Tuning monoT5 0.4271 0.3342 0.2943 58% We aggregate the intents from IntenT5, Google suggestions, and + IntenT5 PM * 0.4510 0.3589 0.3213 58% the Gold intents using xQuAD [43] and PM2 [18], representing two + Google Sug. PM * 0.4506 0.3567 0.3181 58% strong unsupervised aggregation techniques.9 For all these models, + Gold PM * 0.4722 * 0.3777 * 0.3409 58% we tune the number of generated intents and the aggregation + IntenT5 xQ 0.4444 0.3549 0.3183 58% parameter using a grid search over the remaining collections (e.g., + Google Sug. xQ * 0.4492 * 0.3625 * 0.3276 58% WT09 parameters are tuned in WT10–14). We search between 1–20 + Gold xQ * 0.4574 * 0.3696 * 0.3353 58% intents (step 1) and between 0–1 (step 0.1). Neural models re- ColBERT 0.4271 0.3334 0.2953 57% rank the top 100 DPH results. In summary, an initial pool of 100 + IntenT5 PM * 0.4711 * 0.3914 * 0.3616 58% documents is retrieved using DPH. Intents are then chosen using + Google Sug. PM * 0.4561 0.3654 0.3316 57% IntenT5 (or using the baseline methods). The documents are then + Gold PM 0.4548 0.3520 0.3106 51% re-scored for each intent using DPH, Vanilla BERT, monoT5, or + IntenT5 xQ * 0.4707 * 0.3890 * 0.3584 63% ColBERT. The scores are then aggregated using xQuAD or PM2. + Google Sug. xQ * 0.4552 * 0.3638 * 0.3287 61% + Gold xQ * 0.4560 0.3608 0.3226 56% 5 RESULTS In this section, we provide results for research questions concern- ing the overall effectiveness of IntentT5 (Section 5.1), the impact of queries appearing in the training data (Section 5.2), the types not significantly improved when using IntenT5. The overall best- of under-specification (Section 5.3) and finally the impact on am- performing result uses IntenT5 (ColBERT scoring with PM2 ag- biguous queries in particular (Section 5.4). Later in Section 6, we gregation). These results also significantly outperform the corre- provide a qualitative analysis of the generated queries. sponding versions that use Google suggestions and the Gold in- tents. Similarly, when using Vanilla BERT, IntenT5 also significantly 5.1 RQ1: IntenT5 Effectiveness outperforms the model using Google suggestions. For DPH and We present the diversification results for WT09–14 in Table 2. We monoT5, the diversity effectiveness of IntenT5 is similar to that of generally find that our IntenT5 approach can improve the search the Google suggestions; the differences are not statistically signifi- result diversity for both lexical and neural models, when aggre- cant. However, through equivalence testing using a TOST we find gating using either PM2 or xQuAD. In fact, there is only one set- that there is insufficient evidence that the means are equivalent ting (monoT5 scoring with xQuAD aggregation) where diversity is across all evaluation metrics. Curiously, both BERT-based models (Vanilla BERT and ColBERT) are more receptive to the IntenT5 queries than Google suggestions or Gold intents. This suggests that 8 For instance, this may be caused by an intent identified by an IntenT5 model resulting some underlying language models (here, BERT), may benefit more in a more effective ranking than the corresponding Gold intent. from artificially-generated intents than others. 9 Although “learning-to-diversify” approaches, such as M2 Div [21], can be effective, our work focuses on explicit diversification. The explicit setting is advantageous because it allows for a greater degree of model interpretability and transparency with the user.
Table 3: Diversification performance stratified by the fre- the next third of the WebTrack queries, and 38 or more queries quency of the query text in ORCAS. Ranges were selected to forms the final bucket. We present the results of this experiment in best approximate 3 even buckets. Statistically significant dif- Table 3. Here, we find that our IntenT5 model excels at cases where ferences between the unmodified system are indicated by * it either needs to generalize (first bucket) or where memorization (paired t-test, Bonferroni correction, < 0.05). is manageable (second bucket); in 11 of the 16 cases, IntenT5 scores higher than the Google suggestions. Further, IntenT5 boasts the -nDCG@20 overall highest effectiveness in both buckets: 0.5484 and 0.4934, System Agg. 0–1 2–37 38+ respectively (for ColBERT + IntenT5 + PM2). In the final case, DPH 0.4930 0.3876 0.3048 where there are numerous occurrences in the training data, IntenT5 + IntenT5 PM 0.5217 * 0.4283 0.3091 never significantly outperforms the baseline system. Unsurprisingly, + Google Sug. PM 0.5026 * 0.4435 0.3204 Google suggestions score higher than IntenT5 for these queries (6 + Gold PM * 0.5311 * 0.4648 * 0.3704 out of 8 cases). Since frequent queries in ORCAS are likely frequent + IntenT5 xQ 0.4970 * 0.4177 0.2960 in general, Google suggestions can exploit frequency information + Google Sug. xQ 0.5004 * 0.4379 0.3192 from their vast interaction logs (which are absent from ORCAS). + Gold xQ * 0.5259 * 0.4708 * 0.3640 To gain more insights into the effects of training data frequency, Vanilla BERT 0.4467 0.3756 0.3112 we qualitatively evaluate examples of generated intents. For in- + IntenT5 PM * 0.5043 * 0.4403 0.3615 stance, the query “gmat prep classes” (which only occurs once in + Google Sug. PM 0.4634 0.4186 0.3487 ORCAS, as a verbatim match), generates intents such as “require- + Gold PM * 0.4831 * 0.4290 0.3492 ments”, “registration”, and “training”. Although these are not perfect + IntenT5 xQ * 0.5041 * 0.4308 0.3598 matches with the Gold intents (companies that offer these courses, + Google Sug. xQ * 0.4853 0.4027 0.3439 practice exams, tips, similar tests, and two navigational intents), + Gold xQ * 0.4770 * 0.4362 0.3531 they are clearly preferable to the Google suggestions, which focus monoT5 0.5080 0.4331 0.3364 on specific locations (e.g., “near me”, “online”, “chicago”, etc.), and + IntenT5 PM 0.5262 * 0.4707 0.3530 demonstrate the ability for the IntenT5 model to generalize. For the + Google Sug. PM 0.5305 0.4598 0.3577 query “used car parts”, which occurs in ORCAS 13 times, IntenT5 + Gold PM * 0.5421 0.4712 * 0.3997 generates some of the queries found in ORCAS (e.g., “near me”) but + IntenT5 xQ 0.5180 * 0.4648 0.3474 not others (e.g., “catalog”). For the query “toilets”, which occurs + Google Sug. xQ 0.5217 0.4509 * 0.3714 556 times in ORCAS, IntenT5 again generates some queries present + Gold xQ 0.5270 0.4657 * 0.3762 in the training data (e.g., “reviews”) and others that are not (e.g., ColBERT 0.4938 0.4241 0.3600 “installation cost”). These results answer RQ2: IntenT5 effectively generalizes be- + IntenT5 PM * 0.5484 * 0.4934 0.3685 yond what was seen in the training data. However, it can struggle + Google Sug. PM 0.5067 0.4625 0.3968 with cases that occur frequently. This suggests that an ensemble ap- + Gold PM 0.5254 0.4726 0.3635 proach may be beneficial, where intents for infrequent queries are + IntenT5 xQ * 0.5440 * 0.4868 0.3783 generated from IntenT5, and intents for frequent queries are mined + Google Sug. xQ 0.5141 * 0.4631 0.3857 from interaction data (if available). We leave this to future work. + Gold xQ 0.5121 * 0.4874 0.3669 (Count) (105) (96) (99) 5.3 RQ3: Types of Under-specification These results provide a clear answer to RQ1: the intents gener- Recall that under-specified queries can be considered multi-faceted ated from IntenT5 can be used to significantly improve the diversity or ambiguous. To answer RQ3, we investigate the performance of of search results. Further, they can also, surprisingly, outperform IntenT5 on different types of queries, as indicated by the TREC Google suggestions and Gold intents. labels. Note that WT13–14 also include a total of 49 queries that are fully specified (“single”, e.g., “reviews of les miserables”). Ta- 5.2 RQ2: Effect of Queries in Training Data ble 4 provides these results. We find that IntenT5 excels at handling It is possible that the IntenT5 model simply memorizes the data that faceted queries, often yielding significant gains. When it comes to is present in the training dataset, rather than utilizing the language ambiguous queries, however, IntenT5 rarely significantly improves characteristics learned in the pre-training process to generalize to upon the baseline. Note that all intent strategies, including when new queries. To investigate this, we stratify the dataset into three using the Gold intents, struggle with ambiguous queries. However, roughly equal-sized buckets representing the frequency that the we acknowledge that the ambiguous query set is rather small (only query appears in ORCAS. We use simple case-insensitive string 62 queries). This could motivate the creation of a larger ambiguous matching, and count matches if they appear anywhere in the text web search ranking evaluation dataset in the future to allow further (not just at the start of the text). We find that roughly one third of the study of this interesting and challenging problem. Finally, we notice WebTrack diversity queries either do not appear at all in ORCAS, or that IntenT5 can also improve the performance of the fully-specified only appear once. For these queries, IntenT5 is forced to generalize. queries, most notably for Vanilla BERT and ColBERT where the The next bucket (2–37 occurrences in ORCAS) contains roughly non-diversified models otherwise significantly underperform DPH.
Table 4: Diversification performance by query type. Signif- Table 5: Diversification performance by query type when icant differences between the unmodified system are indi- using distributional causal language modeling (DCLM) and cated by * (paired t-test, Bonferroni correction, < 0.05). representation swapping (RS). Statistically significant differ- ences between the unmodified IntenT5 approach are indi- -nDCG@20 cated by * (paired t-test, Bonferroni correction, < 0.05). System Agg. Faceted Ambiguous Single DPH 0.3804 0.2525 0.6399 -nDCG@20 + IntenT5 PM * 0.4093 0.2576 0.6711 System Agg. Faceted Ambiguous Single + Google Sug. PM * 0.4130 0.2629 0.6621 monoT5 + IntenT5 PM 0.4445 0.3165 0.6429 + Gold PM * 0.4626 0.2908 0.6399 + DCLM PM 0.4424 0.3219 0.6628 + IntenT5 xQ 0.3922 0.2433 0.6550 + RS PM 0.4481 0.3279 0.6297 + Google Sug. xQ * 0.4070 0.2724 0.6553 + DCLM + RS PM 0.4457 0.3213 0.6173 + Gold xQ * 0.4565 0.2997 0.6399 monoT5 + IntenT5 xQ 0.4363 0.3214 0.6285 Vanilla BERT 0.3809 0.2501 0.5323 + DCLM xQ 0.4333 0.3421 0.6469 + IntenT5 PM * 0.4244 0.3087 * 0.6415 + RS xQ 0.4341 0.3348 0.6404 + Google Sug. PM * 0.4149 0.2993 0.5353 + DCLM + RS xQ 0.4324 0.3129 0.6249 + Gold PM * 0.4392 0.2831 0.5253 ColBERT + IntenT5 PM 0.4629 0.3246 0.6848 + IntenT5 xQ * 0.4214 * 0.2997 * 0.6421 + DCLM PM * 0.4339 0.3255 0.6584 + Google Sug. xQ 0.4086 0.2970 0.5681 + RS PM 0.4645 0.3185 0.6848 + Gold xQ * 0.4409 0.2850 0.5253 + DCLM + RS PM 0.4420 0.3174 0.6356 monoT5 0.4184 0.3012 0.6172 ColBERT + IntenT5 xQ 0.4596 0.3355 0.6816 + IntenT5 PM 0.4445 0.3165 0.6429 + DCLM xQ 0.4469 0.3260 0.6795 + Google Sug. PM 0.4405 0.3342 0.6340 + RS xQ 0.4564 0.3263 0.6816 + Gold PM * 0.4802 0.3307 0.6177 + DCLM + RS xQ 0.4478 0.3250 0.6795 + IntenT5 xQ 0.4363 0.3214 0.6285 + Google Sug. xQ * 0.4385 * 0.3327 0.6353 + xQuAD) is not significantly more effective than the monoT5 + + Gold xQ * 0.4607 0.3182 0.6177 IntenT5 + xQuAD. ColBERT 0.4237 0.3267 0.5655 Digging deeper into the queries generated for each approach, + IntenT5 PM * 0.4629 0.3246 * 0.6848 we find that there are indeed cases where the generated intents + Google Sug. PM 0.4558 0.3198 0.6268 using DCLM and RS are substantially more diverse than the base + Gold PM * 0.4740 0.3068 0.5655 IntenT5 model. The top intents generated for the query penguins by + IntenT5 xQ * 0.4596 0.3355 * 0.6816 IntenT5 are meaning, history, habitat, information, and definition; + Google Sug. xQ * 0.4536 0.3355 0.6102 in fact, all of the top 20 intents either relate to the animal (rather + Gold xQ * 0.4775 0.3016 0.5655 than the hockey team) or are very general. Meanwhile, DCLM (Count) (189) (62) (49) overcomes many of the general intents, but the queries skew heavily toward the hockey team: schedule, website, wikipedia, highlights, and merchandise. This problem is addressed when applying both DCLM Curiously, we do not observe similar behavior for monoT5, sug- and RS, which generates: wikipedia, tickets, population, schedule, gesting that this behavior may depend on the underlying language and website; it covers both senses. Despite the clear benefits for model (BERT vs. T5). some queries, the approach can cause drift on other queries, and These results answer RQ3: IntenT5 improves the diversity of sometimes does not pick up on important intents. For instance, the multi-faceted queries and even improves ColBERT’s performance intents generated for the query iron with IntenT5 + DCLM + RS for fully-specified queries. However, like alternative approaches, it focus heavily on the nutrient sense, and do not identify the element struggles to generate effective intents for ambiguous queries. or appliance sense. To answer RQ4, although approaches like DCLM and RS can 5.4 RQ4: Handling Ambiguous Queries improve the diversity in isolated cases, there is insufficient evidence Given that ambiguous queries appear to be difficult to handle, we that these approaches can improve ranking diversity overall. We investigate two proposed approaches for overcoming this problem: also find no significant differences in effectiveness between the Distributional Causal Language Modeling (DCLM, introduced in DCLM and RS approaches. Section 3.2) and Representation Swapping (RS, introduced in Sec- tion 3.3). Since monoT5 and ColBERT most effectively use IntenT5 6 ANALYSIS on ambiguous queries, we focus our investigation on these models. One advantage of performing search result diversification explicitly Table 5 presents the effectiveness of these approaches, stratified is that the generated intents are expressed in natural language by query type. In general, we observe only marginal differences and can be interpreted. In Table 6, we present the top 5 intents by using combinations of these approaches. The most effective generated by our models, as well as the top query suggestions from combination for ambiguous queries (monoT5 + IntenT5 + DCLM Google. For the running example of penguins, we see that Google
Table 6: Top intents generated using our IntenT5 model and information, such as for the (fictitious) wendleton college. We see Google search suggestions. that these generalizations are not baked entirely into the prompt of college, however, given that the prefix electoral college (a process IntenT5 IntenT5 + DCLM + RS Google Suggestions of the United States government) does not generate similar queries. penguins These results provide qualitative evidence for our observations in meaning wikipedia of madagascar Section 5.2; IntenT5 is able to effectively generalize beyond what history tickets schedule is seen in the training data. However, we acknowledge that this habitat population score quality may be undesirable in some circumstances. For the query information schedule hockey solar panels, we see that our model can generate multi-word intents definition website game (which can be beneficial to neural models [16]), but can sometimes mitchell college get stuck on common prefixes (e.g., “install”). We also find that football tuition baseball our model can struggle with providing valuable recommendations meaning football covid vaccine based on a specified location. Although IntenT5 with DCLM and RS address athletics athletics can predict salient intents like beachfront and nyc for the queries basketball admissions basketball condos in florida and condos in new york, respectively, the base website bookstore of business IntenT5 model relies primarily on generic intents, or even suggests wendelton college alternative locations. Meanwhile, the Google suggestions are able to address tuition (none) consistently provide location-specific intents. Overall, this analysis football athletics shows that IntenT5 generates intents that exhibit awareness of website bookstore the query at hand, and that the DCLM and RS approaches can tuition faculty change the output of the model substantially. The intents are often application address comparable with those provided by a commercial search engine electoral college from interaction data. meaning wikipedia map definition meaning definition 7 CONCLUSIONS florida definition map 2020 We presented IntenT5, a new approach for generating potential history articles definition government query intents for explicit search result diversification. Across the michigan election college votes TREC WebTrack 2009–2014 datasets (300 test queries in total), we solar panels found that this approach can significantly outperform other sources meaning installation for sale of query intents when used with unsupervised search result diversi- explained installed for home fication algorithms and neural re-rankers, such as Vanilla BERT and calculator installation cost cost ColBERT. Specifically, we observed up to a 15% relative improve- installation on sale for rv ment above query suggestions provided by Google (as measured by home depot for home for house NRBP, when re-ranking with Vanilla BERT and aggregating with condos in florida PM2). The proposed approach significantly improves the perfor- for sale rentals on the beach mance on multi-faceted queries and can even overcome shortcom- meaning beachfront for rent ings on fully-specified queries. We found that IntenT5 has difficulty near me for sale on the beach for sale handling ambiguous queries, and proposed two approaches for reviews near me keys overcoming these ambiguities. Although we observed that these ap- by owner reservations keys for sale proaches can qualitatively improve the generated intents, we found condos in new york insufficient evidence that these modifications are beneficial in ag- meaning for sale for rent gregate. This motivates the creation of a larger and more extensive near me to rent zillow dataset of ambiguous queries for a future study. Importantly, we chicago address manhattan observed that our approach can generalize to queries that were not florida weather state present in the training data. As the first work to investigate the use for sale nyc ny of contextualized language models in the context of search result diversification, we have laid the groundwork for investigating ongo- ing challenges, such as handling query terms with multiple senses. identifies two senses (an animated film and the hockey team) while our model can identify the animal and the hockey team. For the ACKNOWLEDGMENTS query mitchell college, our model identifies several salient facets, This work has been supported by EPSRC grant EP/R018634/1: Closed- as do the Google search suggestions. Note that this is not due to Loop Data Science for Complex, Computationally- & Data-Intensive memorization; the only queries with the text mitchell college in the Analytics. training collection are william mitchell college of law and william mitchell college of law ranking. This quality is appealing because it REFERENCES shows that the model is capable of generalizing beyond its training [1] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. data. On the other hand, our model can be prone to constructing Diversifying search results. In WSDM.
[2] Giambattista Amati, Giuseppe Amodeo, Marco Bianchi, Carlo Gaibisso, and [27] Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021. Pretrained Transformers Giorgio Gambosi. 2008. FUB, IASI-CNR and University of Tor Vergata at TREC for Text Ranking: BERT and Beyond. ArXiv abs/2010.06467 (2021). 2008 Blog Track. In TREC. [28] Sean MacAvaney, Arman Cohan, and Nazli Goharian. 2020. SLEDGE-Z: A Zero- [3] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Shot Baseline for COVID-19 Literature Search. In EMNLP. Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir [29] Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey, and Arman Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, , and Tong Wang. 2016. MS Cohan. 2020. ABNIRML: Analyzing the Behavior of Neural IR Models. arXiv MARCO: A Human Generated MAchine Reading COmprehension Dataset. ArXiv abs/2011.00696 (2020). abs/1611.09268 (2016). [30] Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: [4] Jaime Carbonell and Jade Goldstein. 1998. The Use of MMR, Diversity-Based Contextualized Embeddings for Document Ranking. In SIGIR. Reranking for Reordering Documents and Producing Summaries. In SIGIR. [31] Joel Mackenzie, J. Shane Culpepper, Roi Blanco, Matt Crane, Charles L. A. Clarke, [5] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected and Jimmy Lin. 2018. Query Driven Algorithm Selection in Early Stage Retrieval. Reciprocal Rank for Graded Relevance. In CIKM. In WSDM. [6] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. [32] Clara Meister, Tim Vieira, and Ryan Cotterell. 2020. If Beam Search Is the Answer, ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. What Was the Question?. In EMNLP. ArXiv abs/2003.10555 (2020). [33] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khu- [7] Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, danpur. 2010. Recurrent neural network based language model. In INTERSPEECH. Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity [34] Bhaskar Mitra and Nick Craswell. 2018. An Introduction to Neural Information in Information Retrieval Evaluation. In SIGIR. Retrieval. Found. Trends Inf. Retr. 13 (2018). [8] Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the [35] Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement TREC 2009 Web Track. In TREC. of retrieval effectiveness. ACM Trans. Inf. Syst. 27 (2008). [9] Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Gordon V. Cormack. 2010. [36] Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. Overview of the TREC 2010 Web Track. In TREC. ArXiv abs/1901.04085 (2019). [10] Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. [37] Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document Ranking with Overview of the TREC 2011 Web Track. In TREC. a Pretrained Sequence-to-Sequence Model. In Findings of EMNLP. [11] Charles L. A. Clarke, Nick Craswell, and Ellen M. Voorhees. 2012. Overview of [38] Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTT- the TREC 2012 Web Track. In TREC. query. https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_ [12] Charles L. A. Clarke, Maheedhar Kolla, and Olga Vechtomova. 2009. An Effec- docTTTTTquery.pdf tiveness Measure for Ambiguous and Underspecified Queries. In ICTIR. [39] Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document [13] Kevyn Collins-Thompson, Paul Bennett, Fernando Diaz, Charles L. A. Clarke, Expansion by Query Prediction. ArXiv abs/1904.08375 (2019). and Ellen M. Voorhees. 2013. TREC 2013 Web Track Overview. In TREC. [40] Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021. The Expando-Mono- [14] Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Ellen M. Voorhees. 2014. TREC 2014 Web Track Overview. In TREC. Models. ArXiv abs/2101.05667 (2021). [15] Nick Craswell, Daniel Campos, Bhaskar Mitra, Emine Yilmaz, and Bodo Billerbeck. [41] Filip Radlinski and Nick Craswell. 2017. A Theoretical Framework for Conversa- 2020. ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search. tional Search. In CHIIR. arXiv preprint arXiv:2006.05324 (2020). [42] Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, [16] Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Michael Matena, Yanqi Zhou, W. Li, and Peter J. Liu. 2020. Exploring the Limits of Contextual Neural Language Modeling. In SIGIR. Transfer Learning with a Unified Text-to-Text Transformer. ArXiv abs/1910.10683 [17] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan (2020). Salakhutdinov. 2019. Transformer-XL: Attentive Language Models Beyond a [43] Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2010. Exploiting query Fixed-Length Context. In ACL. reformulations for web search result diversification. In WWW. [18] Van Dang and W. Bruce Croft. 2012. Diversity by proportionality: an election- [44] Rodrygo L. T. Santos, C. Macdonald, and I. Ounis. 2015. Search Result Diversifi- based approach to search result diversification. In SIGIR. cation. Found. Trends Inf. Retr. 9 (2015). [19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: [45] Danny Sullivan. 2018. How Google autocomplete works in Search. https: Pre-training of Deep Bidirectional Transformers for Language Understanding. In //blog.google/products/search/how-google-autocomplete-works-search/ NAACL-HLT. [46] Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Q. Sun, [20] Yue Feng, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2018. Stefan Lee, David J. Crandall, and Dhruv Batra. 2016. Diverse Beam Search: From Greedy Selection to Exploratory Decision-Making: Diverse Ranking with Decoding Diverse Solutions from Neural Sequence Models. ArXiv abs/1610.02424 Policy-Value Networks. In SIGIR. (2016). [21] Yue Feng, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2018. [47] Esteban Walker and Amy S Nowacki. 2011. Understanding equivalence and From Greedy Selection to Exploratory Decision-Making: Diverse Ranking with noninferiority testing. Journal of general internal medicine 26, 2 (2011). Policy-Value Networks. In SIGIR. [48] Jun Wang and Jianhan Zhu. 2009. Portfolio Theory of Information Retrieval. In [22] Sebastian Hofstätter, Markus Zlabinger, and A. Hanbury. 2020. Interpretable & SIGIR. Time-Budget-Constrained Contextualization for Re-Ranking. In ECAI. [49] Long Xia, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2017. [23] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation Adapting Markov Decision Process for Search Result Diversification. In SIGIR. of IR techniques. ACM Trans. Inf. Syst. 20 (2002). [50] Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, J. Liu, Paul Bennett, Junaid [24] Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative 2016. Exploring the Limits of Language Modeling. ArXiv abs/1602.02410 (2016). Contrastive Learning for Dense Text Retrieval. ArXiv abs/2007.00808 (2020). [25] Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage [51] Sevgi Yigit-Sert, Ismail Sengor Altingovde, Craig Macdonald, Iadh Ounis, and Search via Contextualized Late Interaction over BERT. In SIGIR. Özgür Ulusoy. 2020. Supervised approaches for explicit search result diversifica- [26] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman tion. Information Processing & Management 57, 6 (2020). Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: [52] Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Translation, and Comprehension. In ACL. Representation for Inverted Indexing. In CIKM.
You can also read