TEMPER : A Temporal Relevance Feedback Method
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TEMPER : A Temporal Relevance Feedback Method Mostafa Keikha, Shima Gerani and Fabio Crestani {mostafa.keikha, shima.gerani, fabio.crestani}@usi.ch University of Lugano, Lugano, Switzerland Abstract. The goal of a blog distillation (blog feed search) method is to rank blogs according to their recurrent relevance to the query. An inter- esting property of blog distillation which differentiates it from traditional retrieval tasks is its dependency on time. In this paper we investigate the effect of time dependency in query expansion. We propose a framework, TEMPER, which selects different terms for different times and ranks blogs according to their relevancy to the query over time. By generat- ing multiple expanded queries based on time, we are able to capture the dynamics of the topic both in aspects and vocabulary usage. We show performance gains over the baseline techniques which generate a single expanded query using the top retrieved posts or blogs irrespective of time. 1 Introduction User generated content is growing very fast and becoming one of the most im- portant sources of information on the Web. Blogs are one of the main sources of information in this category. Millions of people write about their experiences and express their opinions in blogs everyday. Considering this huge amount of user generated data and its specific prop- erties, designing new retrieval methods is necessary to facilitate addressing dif- ferent types of information needs that blog users may have. Users’ information needs in blogosphere are different from those of general Web users. Mishne and de Rijke [1] analyzed a blog query log and accordingly they divided blog queries into two broad categories called context and concept queries. In context queries users are looking for contexts of blogs in which a Named Entity occurred to find out what bloggers say about it, whereas in concept queries they are looking for blogs which deal with one of searcher’s topics of interest. In this paper we focus on the blog distillation task (also known as blog feed search)1 where the goal is to answer topics from the second category [2]. Blog distillation is concerned with ranking blogs according to their recurring central interest to the topic of a user’s query. In other words, our aim is to discover relevant blogs for each topic2 that a user can add to his reader and read them in future [3]. 1 In this paper we use words “feed” and “blog” interchangeably 2 In this paper we use words “topic” and “query” interchangeably
An important aspect of blog distillation, which differentiates it from other IR tasks, is related to the temporal properties of blogs and topics. Distillation topics are often multifaceted and can be discussed from different perspectives [4]. Vocabulary usage in the relevant documents to a topic can change over time in order to express different aspects (or sub-topics) of the query. These dynamics might create term mismatch problem during the time, such that a query term may not be a good indicator of the query topic in all different time intervals. In order to address this problem, we propose a time-based query expansion method which expands queries with different terms at different times. This contrasts other applied query expansion methods in blog search where they generate only one single query in the expansion phase [5, 4]. Our experiments on different test collections and different baseline methods indicate that time-base query expansion is effective in improving the retrieval performance and can outperform existing techniques. The rest of the paper is organized as follows. In section 2 we review state of the art methods in blog retrieval. Section 3 describes existing query expansion methods for blog retrieval in more detail. Section 4 explains our time-based query expansion approach. Experimental results over different blog data sets are discussed in section 6. Finally, we conclude the paper and describe future work in section 7. 2 Related Work The main research on the blog distillation started after 2007, when the TREC organizers proposed this task in the blog track [3]. Researchers have applied different methods from areas that are similar to blog distillation, like ad-hoc search, expert search and resource selection in distributed information retrieval. The most simple models use ad-hoc search methods for finding relevant blogs to a specific topic. They treat each blog as one long document created by con- catenating all of its posts together [6, 4, 7]. These methods ignore any specific property of blogs and mostly use standard IR techniques to rank blogs. Despite their simplicity, these methods perform fairly well in blog retrieval. Some other approaches have been applied from expert search methods in blog retrieval [8, 2]. In these models, each post in a blog is seen as evidence that the blog has an interest in the query topic. In [2], MacDonald et al. use data fusion models to combine this evidence and compute a final relevance score for the blog, while Balog et al. adapt two language modeling approaches of expert finding and show their effectiveness in blog distillation [8]. Resource selection methods from distributed information retrieval have been also applied to blog retrieval [4, 9, 7]. Elsas et al. deal with blog distillation as a recourse selection problem [4, 9]. They model each blog as a collection of posts and use a Language Modeling approach to select the best collection. A similar approach is proposed by Seo and Croft [7], which they call Pseudo Cluster Selection. They create topic-based clusters of posts in each blog and select blogs that have the most similar clusters to the query.
Temporal properties of posts have been considered in different ways in blog retrieval. Nunes et al. define two new measures called “temporal span” and “tem- poral dispersion” to evaluate “how long” and “how frequently” a blog has been writing about a topic [10]. Similarly Macdonald and Ounis [2] use a heuristic measure to capture the recurring interests of blogs over time. Some other ap- proaches give higher scores to more recent posts before aggregating them [11, 12]. All these proposed methods and their improvements show the importance and usefulness of temporal information in blog retrieval. However, none of the mentioned methods investigates the effect of time on the vocabulary change for a topic. We employ the temporal information as a source to distinguish between different aspects of topic and terms that are used for each aspect. This leads us to a time-based query expansion method where we generate mutliple expanded queries to cover multiple aspects of a topic over time. Different query expansion possibilities for blog retrieval have been explored by Elsas et al. [4] and Lee et al. [5]. Since we use these methods as our baselines, we will discuss them in more detail in the next section. 3 Query Expansion in Blog Retrieval Query expansion is known to be effective in improving the performance of the retrieval systems [13–15]. In general the idea is to add more terms to an initial query in order to disambiguate the query and solve the possible term mismatch problem between the query and the relevant documents. Automatic Query Ex- pansion techniques usually assume that top retrieved documents are relevant to the topic and use their content to generate an expanded query. In some situa- tions, it has been shown that it is better to have multiple expanded queries as apposed to the usual single query expansion, for example in server-based query expansion technique in distributed information retrieval [16]. An expanded query, while being relevant to the original query, should have as much coverage as possible on all aspects of the query. If the expanded query is very specific to some aspect of the original query, we will miss part of the relevant documents in the re-ranking phase. In blog search context, where queries are more general than normal web search queries [4], the coverage of the expanded query gets even more important. Thus in this condition, it might be better to have multiple queries where each one covers different aspects of a general query. Elsas et al. made the first investigation on the query expansion techniques for blog search [4]. They show that normal feedback methods (selecting the new terms from top retrieved posts or top retrieved blogs) using the usual parameter settings is not effective in blog retrieval. However, they show that expanding query using an external resource like Wikipedia can improve the performance of the system. In a more recent work, Lee et al. [5] propose new methods for select- ing appropriate posts as the source of expansion and show that these methods can be effective in retrieval. All these proposed methods can be summarized as follows:
– Top Feeds: Uses all the posts of the top retrieved feeds for the query expan- sion. This model has two parameters including number of selected feeds and number of the terms in the expanded query [4]. – Top Posts: Uses the top retrieved posts for the query expansion. Number of the selected posts and number of the terms to use for expansion are the parameters of this model [4]. – FFBS: Uses the top posts in the top retrieved feeds as the source for selecting the new terms. Number of the selected posts from each feed is fixed among different feeds. This model has three parameters; number of the selected feeds, number of the selected posts in each feed and number of the selected terms for the expansion [5]. – WFBS: Works the same as FFBS. The only difference is that number of the selected posts for each feed depends on the feed rank in the initial list, such that more relevant feeds contribute more in generating the new query. Like FFBS, WFBS has also three parameters that are number of the selected feeds, total number of the posts to be used in the expansion and number of the selected terms [5]. Among the mentioned methods,“Top Feeds” method has the possibility to expand the query with non-relevant terms. The reason is that all the posts in a top retrieved feed are not necessarily relevant to the topic. On the other hand, “Top Posts” method might not have enough coverage on all the sub- topics of the query, because the top retrieved posts might be mainly relevant to some dominant aspect of the query. FFBS and WFBS methods were originally proposed in order to have more coverage than the “Top Posts” method while selecting more relevant terms than the “Top Feeds” method [5]. However, since it is difficult to summarize all the aspects of the topic in one single expanded query, these methods would not have the maximum possible coverage. 4 TEMPER In this section we describe our novel framework for time-based relevance feedback in blog distillation called TEMPER. TEMPER assumes that posts at different times talk about different aspects (sub-topics) of a general topic. Therefore, vocabulary usage for the topic is time-dependant and this dependancy can be considered in a relevance feedback method. Following this intuition, TEMPER selects time-dependent terms for query expansion and generated one query for each time point. We can summarize the TEMPER framework in the following 3 steps: 1. Time-based representation of blogs and queries 2. Time-based similarity between a blogs and a query 3. Ranking blogs according to the their overall similarity to the query. In the remainder of this section, we describe our approach in fulfilling each of these steps.
4.1 Time-Based Representation of Blogs and Queries Initial Representation of Blogs and Queries In order to consider time in the TEMPER framework, we first need to represent blogs and queries in the time space. For a blog representation, we distribute its posts based on their publish date. In order to have a daily representation of the blog, we concatenate all the posts that have the same date. For a query representation, we take advantage of the top retrieved posts for the query. Same as blog representation, we select the top K relevant posts for the query and divide them based on their publish date while concatenating posts with the same date. In order to have a more informative representation of the query, we select the top N terms for each day using the KL-divergence between the term distribution of the day and the whole collection [17]. Note that in the initial representation, there can be days that do not have any term distribution associated with them. However, in order to calculate the relevance of a blog to a query, TEMPER needs to have the representation of the blog and query in all the days. We employ the available information in the initial representation to estimate the term distributions for the rest of the days. In the rest of this section, we explain our method for estimating these representations. Term Distributions Over Time TEMPER generates a representation for each topic or blog for each day based on the idea that a term at each time posi- tion propagates its count to the other time positions through a proximity-based density function. By doing so, we can have a virtual document for a blog/topic at each specific time position. The term frequencies of such a document is cal- culated as follows: T X 0 tf (t, d, i) = tf (t, d, j)K(i, j) (1) j=1 where i and j indicate time position (day) in the time space. T denotes the time span of the collection. tf 0 shows the term frequency of term t in blog/topic d at day i and it is calculated based on the frequency of t in all days. K(i, j) decreases as the distance between i and j increases and can be calculated using kernel functions that we describe later. The proposed representation of document in the time space is similar to the proximity-based method where they generate a virtual document at each position of the document in order to capture the proximity of the words [18, 19]. However, here we aim to capture the temporal proximity of terms. In this paper we employ the laplace kernel function which has been shown to be effective in a previous work [19] together with the Rectangular (square) kernel function. In the following formulas, we present normalized kernel functions with their corresponding variance formula.
1. Laplace Kernel 1 − |i − j| k(i, j) = exp 2b b (2) where σ 2 = 2b2 2. Rectangular Kernel 1 2a if |i − j| ≤ a k(i, j) = 0 otherwise (3) a2 2 where σ = 3 4.2 Time-Based Similarity Measure By having the daily representation of queries and blogs, we can calculate the daily similarity between these two representations and create a daily similarity vector for the blog and the query. The final similarity between the blog and the query is then calculated by summing over the daily similarities: T X simtemporal (B, Q) = sim(B, Q, i) (4) i=1 where sim(Bi , Qi ) shows the similarity between a blog and a query representa- tion at day i and T shows the time span of the collection in days. Another popular method in time series similarity calculation is to see each time point as one dimension in the time space and use the euclidian length of the daily similarity vector as the final similarity between the two representations [20]: v u T uX simtemporal (B, Q) = t sim(B, Q, i)2 (5) i=1 We use the cosine similarity as a simple and effective similarity measure for calculating similarity between the blog and the topic representations at the specific day i: P tf (w, B, i) × tf (w, Q, i) sim(B, Q, i) = pP w P (6) 2 2 w tf (w, B, i) × w tf (w, Q, i) The normalized value of the temporal similarity over all blogs is then used as Ptemporal . simtemporal (B, Q) Ptemporal (B|Q) = P 0 (7) B 0 simtemporal (B , Q) Finally in order to take advantage of all the available evidence regarding the blog relevance, we interpolate the temporal score of the blog with its initial relevance score.
Table 1. Effect of cleaning the data set on Blogger Model. Statistically significant improvements at the 0.05 level is indicated by †. Model Cleaned MAP P@10 Bpref BloggrModel No 0.2432 0.3513 0.2620 BloggrModel Yes 0.2774† 0.4154 † 0.2906† P (B|Q) = αPinitial (B|Q) + (1 − α)Ptemporal (B|Q) (8) where α is a parameter that controls the amount of temporal relevance that is considered in the model. We use the Blogger Model method for the initial ranking of the blogs [8] . The only difference with the original Blogger Model is that we set the prior of a blog to be proportional to the log of the number of its posts, as opposed to the uniform prior that was used in the original Blogger Model. This log-based prior has been used and shown to be effective by Elsas et al. [4]. 5 Experimental Setup In this section we first explain our experimental setup for evaluating the effec- tiveness of the proposed framework. Collection and Topics We conduct our experiments over three years worth of TREC blog track data from the blog distillation task, including TREC’07, TREC’08 and TREC’09 data sets. The TREC’07 and TREC’08 data sets include 45 and 50 assessed queries respectively and use Blog06 collection. The TREC’09 data set uses Blog08, a new collection of blogs, and has 39 new queries 3 We use only the title of the topics as the queries. The Blogs06 collection is a crawl of about one hundred thousand blogs over an 11-weeks period [22], and includes blog posts (permalinks), feed, and homepage for each blog. Blog08 is a collection of about one million blogs crawled over a year with the same structure as Blog06 collection [21]. In our experiments we only use the permalinks component of the collection, which consist of approximately 3.2 million documents for Blog06 and about 28.4 million documents for Blog08. We use the Terrier Information Retrieval system4 to index the collection with the default stemming and stopwords removal. The Language Modeling approach using the dirichlet-smoothing has been used to score the posts and retrieve top posts for each query. 3 Initially there were 50 queries in TREC 2009 data set but some of them did not have relevant blogs for the selected facets and are removed in the official query set [21]. We do not use of the facets in this paper however we use the official query set to be able to compare with the TREC results. 4 http://ir.dcs.gla.ac.uk/terrier/
Table 2. Evaluation results for the implemented models over TREC09 data set. Model MAP P@10 Bpref BloggerModel 0.2774 0.4154 0.2906 TopFeeds 0.2735 0.3897 0.2848 TopPosts 0.2892 0.4230 0.3057 FFBS 0.2848 0.4128 0.3009 WFBS 0.2895 0.4077 0.3032 TEMPER-Rectangular-Sum 0.2967 † 0.4128 0.3116 † TEMPER-Rectangular-Euclidian 0.3014 † ‡ ∗ 0.4435 ∗ 0.3203 † ‡ ∗ TEMPER-Laplace-Sum 0.3086 † 0.4256 0.3295 † TEMPER-Laplace-Euclidian 0.3122 † ‡ ∗ 0.4307 0.3281 † ∗ Retrieval Baselines We perform our feedback methods on the results of the Blogger Model method [8]. Therefore, Blogger Model is the first baseline against which, we will compare the performance of our proposed methods. The second set of baselines are the query expansion methods proposed in previous works [4, 5]. In order to have a fair comparison, we implemented the mentioned query expansion methods on top of Blogger Model. We tuned the parameters of these models using 10-fold cross validation in order to maximize MAP. The last set of baselines are provided by TREC organizers as part of the blog facet distillation task. We use these baselines to see the effect of TEMPER in re-ranking the results of other retrieval systems. Evaluation We used the blog distillation relevance judgements provided by TREC for evaluation. We report the Mean Average Precision (MAP) as well as binary Preference (bPref), and Precision at 10 documents (P@10). Throughout our experiments we use the Wilcoxon signed ranked matched pairs test with a confidence level of 0.05 level for testing statistical significant improvements. 6 Experimental Results In this section we explain the experiments that we conducted in order to eval- uate the usefulness of the proposed method. We mainly focus on the results of TREC09 data set, as it is the most recent data set and has enough temporal information which is an important feature for our analysis. However, in order to see the effect of the method on the smaller collections, we briefly report the final results on the TREC07 and TREC08 data sets. Table 1 shows the evaluation results of Blogger Model on TREC09 data set. Because of the blog data being highly noisy, we carry out a cleaning step on the collection in order to improve the overall performance of the system. We use the cleaning method proposed by Parapar et al. [23]. As we can see in Table 1, cleaning the collection is very useful and improves the MAP of the system about 14%. We can see that the results of Blogger Model on the cleaned data is already better than the best TREC09 submission on the title-only queries.
Table 3. Evaluation results for the implemented models over TREC08 data set. Model MAP P@10 Bpref BloggerModel 0.2453 0.4040 0.2979 TopPosts 0.2567 0.4080 0.3090 WFBS 0.2546 0.3860 0.3087 TEMPER-Laplace-Euclidian 0.2727 † ‡ ∗ 0.4380 † ‡ ∗ 0.3302 † ∗ Table 4. Evaluation results for the implemented models over TREC07 data set. Model MAP P@10 Bpref BloggerModel 0.3354 0.4956 0.3818 TopPosts 0.3524 † 0.5044 0.3910 WFBS 0.3542 † 0.5356 † ‡ 0.3980 TEMPER-Laplace-Euclidian 0.3562 † 0.5111 0.4011 Table 2 summarizes retrieval performance of Blogger Model and the baseline query expansion methods along with different settings of TEMPER on the TREC 2009 data set. The best value in each column is bold face. A dag(†), a ddag(‡) and a star(∗) indicate statistically significant improvement over Blogger Model, TopPosts and WFBS respectively. As can be seen from the table, none of the query expansion baselines improves the underlying Blogger Model significantly. From table 2 we can see that TEMPER with different settings (using rectan- gular/laplace kernel, sum/euclidean similarity method) improves Blogger Model and the query expansion methods significantly. These results show the effective- ness of time-based representation of blogs and query and highlights the impor- tance of time-based similarity calculation of blogs and topics. In tables 3 and 4 we present similar results over TREC08 and TREC07 data sets. Over the TREC08 dataset, it can be seen that TEMPER improves Blogger Model and different query expansion methods significantly. Over the TREC07 dataset, TEMPER improves Blogger Model significantly. However, the performance of TEMPER is comparable with the other query expansion methods and the difference is not statistically significant. As it was mentioned in section 5, we also consider the three standard base- lines provided by TREC10 organizers in order to see the effect of our proposed feedback method on retrieval baselines other than Blogger Model. Table 8 shows the results of TEMPER over the TREC baselines. It can be seen that TEMPER improves the baselines in most of the cases. The only baseline that TEMPER does not improve significantly is stdbaseline15 . Tables 5, 6 and 7 show the performance of TEMPER compared to the best title-only TREC runs in 2009, 2008 and 2007 respectively. It can be seen from the tables that TEMPER is performing better than the best TREC runs over the TREC09 dataset. The results over the TREC08 and TREC07 are comparable 5 Note that the stdbaslines are used as blackbox and we are not yet aware of the underlying method
Table 5. Comparison with the best TREC09 title-only submissions. Model MAP P@10 Bpref TEMPER-Laplace-Euclidian 0.3122 0.4307 0.3281 TREC09-rank1 (buptpris 2009) 0.2756 0.3206 0.2767 TREC09-rank2 (ICTNET) 0.2399 0.3513 0.2384 TREC09-rank3 (USI) 0.2326 0.3308 0.2409 Table 6. Comparison with the best TREC08 title-only submissions. Model MAP P@10 Bpref TEMPER-Laplace-Euclidian 0.2727 0.4380 0.3302 TREC08-rank2 (CMU-LTI-DIR) 0.3056 0.4340 0.3535 TREC08-rank1 (KLE) 0.3015 0.4480 0.3580 TREC08-rank3 (UAms) 0.2638 0.4200 0.3024 to the best TREC runs and can be considered as the third and second best reported results over TREC08 and TREC07 datasets respectively. TEMPER has four parameters including : number of the posts selected for expansion, number of the terms that are selected for each day, standard deviation (σ) of the kernel functions and α as the weight of the initial ranking score. Among these parameters, we fix number of the terms for each day to be 50, as used in a previous work [4]. Standard deviation of the kernel function is estimated using top retrieve posts for each query. Since the goal of the kernel function is to model the distribution of distance between two consequent relevant posts, we assume the distances between selected posts (top retrieved posts) as the samples of this distribution. We then use the standard deviation of the sample as an estimation for σ. The other two parameters are tuned using 10-fold cross validation method. Figure 1 and 2 show sensitivity of the system to these parameters. It can be seen that the best performance is gained by selecting about 150 posts for ex- pansion while any number more than 50 gives a reasonable result. The value of α depends on the underneath retrieval model. We can see that TEMPER outperforms Blogger Model for all values of α and the best value is about 0.1. 7 Conclusion and Future Work In this paper we investigated blog distillation where the goal is to rank blogs according to their recurrent relevance to the topic of the query. We focused on the temporal properties of blogs and its application in query expansion for blog retrieval. Following the intuition that term distribution for a topic might change over time, we propose a time-based query expansion technique. We showed that it is effective to have multiple expanded queries for different time points and score the posts of each time using the corresponding expanded query. Our experiments on different blog collections and different baseline methods showed that this method can improve the state of the art query expansion techniques.
Table 7. Comparison with the best TREC07 title-only submissions. Model MAP P@10 Bpref TEMPER-Laplace-Euclidian 0.3562 † 0.5111 0.4011 TREC07-rank1 (CMU) 0.3695 0.5356 0.3861 TREC07-rank2 (UGlasgow) 0.2923 0.5311 0.3210 TREC07-rank3 (UMass) 0.2529 0.5111 0.2902 Table 8. Evaluation results for the standard baselines on TREC09 data set. Statisti- cally significant improvements are indicated by †. Model MAP P@10 Bpref stdBaseline1 0.4066 0.5436 0.4150 TEMPER-stdBaseline1 0.4114 0.5359 0.4182 stdBaseline2 0.2739 0.4103 0.2845 TEMPER-stdBaseline2 0.3009† 0.4308 † 0.3158† stdBaseline3 0.2057 0.3308 0.2259 TEMPER-stdBaseline3 0.2493† 0.4026† 0.2821† Future work will involve more analysis on temporal properties of blogs and topics. In particular, modeling the evolution of topics over time can help us to better estimate the topics relevance models. This modeling over time can be seen as a temporal relevance model which is an unexplored problem in blog retrieval. 8 Acknowledgement This work was supported by Swiss National Science Foundation (SNSF) as XMI project (ProjectNr. 200021-117994/1). References 1. Mishne, G., de Rijke, M.: A study of blog search. In: Proceedings of ECIR 2006. (2006) 289–301 2. Macdonald, C., Ounis, I.: Key blog distillation: ranking aggregates. In: Proceedings of CIKM 2008. (2008) 1043–1052 3. Macdonald, C., Ounis, I., Soboroff, I.: Overview of the trec-2007 blog track. In: Proceedings of TREC 2007. (2008) 4. Elsas, J.L., Arguello, J., Callan, J., Carbonell, J.G.: Retrieval and feedback models for blog feed search. In: Proceedings of SIGIR 2008. (2008) 347–354 5. Lee, Y., Na, S.H., Lee, J.H.: An improved feedback approach using relevant local posts for blog feed retrieval. In: Proceedings of CIKM 2009. (2009) 1971–1974 6. Efron, M., Turnbull, D., Ovalle, C.: University of Texas School of Information at TREC 2007. In: Proceedings of TREC 2007. (2008) 7. Seo, J., Croft, W.B.: Blog site search using resource selection. In: Proceedings of CIKM 2008, New York, NY, USA, ACM (2008) 1053–1062 8. Balog, K., de Rijke, M., Weerkamp, W.: Bloggers as experts: feed distillation using expert retrieval models. In: Proceedings of SIGIR 2008. (2008) 753–754 9. Arguello, J., Elsas, J., Callan, J., Carbonell, J.: Document representation and query expansion models for blog recommendation. In: Proceedings of ICWSM 2008. (2008)
0.316 TEMPER TEMPER 0.314 0.32 Blogger Model 0.312 0.31 0.31 0.308 0.3 0.306 MAP MAP 0.304 0.29 0.302 0.3 0.28 0.298 0.296 0.27 0.294 0 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1 Number of the posts Alpha Fig. 1. Effect of number of the posts Fig. 2. Effect of alpha on the perfor- used for expansion on the performance mance of TEMPER. of TEMPER. 10. Nunes, S., Ribeiro, C., David, G.: Feup at trec 2008 blog track: Using temporal evidence for ranking and feed distillation. In: Proceedings of TREC 2008. (2009) 11. Ernsting, B., Weerkamp, W., de Rijke, M.: Language modeling approaches to blog postand feed finding. In: Proceedings of TREC 2007. (2007) 12. Weerkamp, W., Balog, K., de Rijke, M.: Finding key bloggers, one post at a time. In: Proceedings of ECAI 2008. (2008) 318–322 13. Cao, G., Nie, J.Y., Gao, J., Robertson, S.: Selecting good expansion terms for pseudo-relevance feedback. In: Proceedings of SIGIR 2008, New York, NY, USA, ACM (2008) 243–250 14. Lavrenko, V., Croft, W.B.: Relevance based language models. In: Proceedings of SIGIR 2001, New York, NY, USA, ACM (2001) 120–127 15. Salton, G.: The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1971) 16. Shokouhi, M., Azzopardi, L., Thomas, P.: Effective query expansion for federated search. In: Proceedings of SIGIR 2009, New York, NY, USA, ACM (2009) 427–434 17. Zhai, C., Lafferty, J.D.: Model-based feedback in the language modeling approach to information retrieval. (2001) 403–410 18. Lv, Y., Zhai, C.: Positional language models for information retrieval. In: Pro- ceedings SIGIR ’09. (2009) 299–306 19. Gerani, S., Carman, M.J., Crestani, F.: Proximity-based opinion retrieval. In: Proceedings of SIGIR ’10. (2010) 403–410 20. Keogh, E.J., Pazzani, M.J.: Relevance feedback retrieval of time series data. In: Proceeding of SIGIR 1999. (1999) 183–190 21. Macdonald, C., Ounis, I., Soboroff, I.: Overview of the TREC-2009 Blog Track. In: Proceedings of TREC 2009. (2009) 22. Macdonald, C., Ounis, I.: The TREC Blogs06 collection: Creating and analysing a blog test collection. Department of Computer Science, University of Glasgow Tech Report TR-2006-224 (2006) 23. Parapar, J., López-Castro, J., Barreiro, Á.: Blog Posts and Comments Extraction and Impact on Retrieval Effectiveness. In: Proceeding of Spanish Conference on Information Retrieval 2010. (2010)
You can also read