Bayesian Approaches to Distribution Regression - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Bayesian Approaches to Distribution Regression Ho Chung Leon Law∗ Danica J. Sutherland∗ Dino Sejdinovic Seth Flaxman University of Oxford University College London University of Oxford Imperial College London ho.law@spc.ox.ac.uk djs@djsutherland.ml dino.sejdinovic@stats.ox.ac.uk s.flaxman@imperial.ac.uk arXiv:1705.04293v4 [stat.ML] 14 Jan 2021 Abstract graphic groups (Flaxman et al., 2015, 2016), and learning the total mass of dark matter halos from observable galaxy Distribution regression has recently attracted velocities (Ntampaka et al., 2015, 2016). Closely related much interest as a generic solution to the prob- distribution classification problems also include identify- lem of supervised learning where labels are avail- ing the direction of causal relationships from data (Lopez- able at the group level, rather than at the individ- Paz et al., 2015) and classifying text based on bags of word ual level. Current approaches, however, do not vectors (Yoshikawa et al., 2014; Kusner et al., 2015). propagate the uncertainty in observations due to One particularly appealing approach to the distribution re- sampling variability in the groups. This effec- gression problem is to represent the input set of samples tively assumes that small and large groups are by their kernel mean embedding (described in Section 2.1), estimated equally well, and should have equal where distributions are represented as single points in a re- weight in the final regression. We account for producing kernel Hilbert space. Standard kernel methods this uncertainty with a Bayesian distribution re- can then be applied for distribution regression, classifica- gression formalism, improving the robustness tion, anomaly detection, and so on. This approach was per- and performance of the model when group sizes haps first popularized by Muandet et al. (2012); Szábo et al. vary. We frame our models in a neural network (2016) provided a recent learning-theoretic analysis. style, allowing for simple MAP inference using backpropagation to learn the parameters, as well In this framework, however, each distribution is simply rep- as MCMC-based inference which can fully prop- resented by the empirical mean embedding, ignoring the agate uncertainty. We demonstrate our approach fact that large sample sets are much more precisely under- on illustrative toy datasets, as well as on a chal- stood than small ones. Most studies also use point esti- lenging problem of predicting age from images. mates for their regressions, such as kernel ridge regression or support vector machines, thus ignoring uncertainty both in the distribution embeddings and in the regression model. 1 INTRODUCTION Our Contributions We propose a set of Bayesian ap- Distribution regression is the problem of learning a regres- proaches to distribution regression. The simplest method, sion function from samples of a distribution to a single set- similar to that of Flaxman et al. (2015), is to use point level label. For example, we might attempt to infer the estimates of the input embeddings but account for uncer- sentiment of texts based on word-level features, to predict tainty in the regression model with simple Bayesian linear the label of an image based on small patches, or even per- regression. Alternatively, we can treat uncertainty in the in- form traditional parametric statistical inference by learning put embeddings but ignore model uncertainty with the pro- a function from sets of samples to the parameter values. posed Bayesian mean shrinkage model, which builds on a recently proposed Bayesian nonparametric model of uncer- Recent years have seen wide-ranging applications of this tainty in kernel mean embeddings (Flaxman et al., 2016), framework, including inferring summary statistics in Ap- and then use a sparse representation of the desired function proximate Bayesian Computation (Mitrovic et al., 2016), in the RKHS for prediction in the regression model. This estimating Expectation Propagation messages (Jitkrittum model allows for a full account of uncertainty in the mean et al., 2015), predicting the voting behaviour of demo- embedding, but requires a point estimate of the regression function for conjugacy; we thus use backpropagation to ob- tain a MAP estimate for it as well as various hyperparam- Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) 2018, Lanzarote, Spain. eters. We then combine the treatment of the two sources PMLR: Volume 84. Copyright 2018 by the author(s). * These authors contributed equally.
Bayesian Approaches to Distribution Regression of uncertainty into a fully Bayesian model and use Hamil- 2.2 Estimating Mean Embeddings tonian Monte Carlo for efficient inference. Depending on the inferential goals, each model can be useful. We demon- For a set of samples {xj }nj=1 drawn iid from P, the empir- strate our approaches on an illustrative toy problem as well ical estimator of µP , µ cP ∈ Hk , is given by as a challenging real-world age estimation task. Z n 1X cP = µPb = µ k (·, x) P̂(dx) = k(·, xj ). (3) 2 BACKGROUND n j=1 2.1 Problem Overview This is the standard estimator used by previous distribution regression approaches, which the reproducing property of Distribution regression is the task of learning a classifier or Hk shows us corresponds to the kernel a regression function that maps probability distributions to labels. The challenge of distribution regression goes be- 1 X Ni XNj yond the standard supervised learning setting: we do not hd d µP i , µP j iHk = k(xi` , xjr ). (4) Ni Nj have access to exact input-output pairs since the true in- r=1 `=1 puts, probability distributions, are observed only through But (3) is an empirical mean estimator in a high- or infinite- samples from that distribution: dimensional space, and is thus subject to the well-known {x1j }N n Nn j=1 , y1 , . . . , {xj }j=1 , yn , 1 (1) Stein phenomenon, so that its performance is dominated by the James-Stein shrinkage estimators. Indeed, Muandet so that each bag {xij }N j=1 has a label yi along with Ni indi- i et al. (2014) studied shrinkage estimators for mean embed- vidual observations xij ∈ X . We assume that the observa- dings, which can result in substantially improved perfor- Ni tions {xij }j=1 are i.i.d. samples from some unobserved dis- mance for some tasks (Ramdas and Wehbe, 2015). Flax- tribution Pi , and that the true label yi depends only on Pi . man et al. (2016) proposed a Bayesian analogue of shrink- We wish to avoid making any strong parametric assump- age estimators, which we now review. tions on the Pi . For the present work, we will assume the This approach consists of (1) a Gaussian Process prior labels yi are real-valued; Appendix B shows an extension µP ∼ GP(m0 , r(·, ·)) on Hk , where r is selected to en- to binary classification. We typically take the observation sure that µP ∈ Hk almost surely and (2) a normal likeli- space X to be a subset of Rp , but it could easily be a struc- hood µcP (x) | µP (x) ∼ N (µP (x), Σ). Here, conjugacy tured domain such as text or images, since we access it only of the prior and the likelihood leads to a Gaussian process through a kernel (for examples, see e.g. Gärtner, 2008). posterior on the true embedding µP , given that we have ob- We consider the standard approach to distribution regres- served µ cP at some set of locations x. The posterior mean sion, which relies on kernel mean embeddings and ker- is then essentially identical to a particular shrinkage esti- nel ridge regression. For any positive definite kernel func- mator of Muandet et al. (2014), but the method described tion k : X × X → R, there exists a unique reproduc- here has the extra advantage of a closed form uncertainty ing kernel Hilbert space (RKHS) Hk , a possibly infinite- estimate, which we utilise in our distributional approach. dimensional space of functions f : X → R where eval- For the choice of r, we use a Gaussian RBF kernel k, and uation can be written as an inner product, and in particu- choose either R r = k or, following Flaxman et al. (2016), lar f (x) = hf, k(·, x)iHk for all f ∈ Hk , x ∈ X . Here r(x, x0 ) = k(x, z) k(z, x0 ) ν(dz) where ν is proportional k(·, x) ∈ Hk is a function of one argument, y 7→ k(y, x). to a Gaussian measure. For details of our choices, and why they are sufficient for our purposes, see Appendix A. Given a probability measure P on X , let us define the kernel mean embedding into Hk as This model accounts for the uncertainty based on the num- Z ber of samples Ni , shrinking the embeddings for small µP = k (·, x) P(dx) ∈ Hk . (2) sample sizes more. As we will see, this is essential in the context of distribution regression, particularly when bag Notice that µP serves as a high- or infinite-dimensional sizes are imbalanced. vector representation of P. For the kernel mean embed- Rding p of P into Hk to be well-defined, it suffices that 2.3 Standard Approaches to Distribution Regression k(x, x)P(dx) < ∞, which is trivially satisfied for all P if k is bounded. Analogously to the reproducing prop- Following Szábo et al. (2016), assume that the probability erty of R RKHS, µP represents the expectation function on distributions Pi are each drawn randomly from some un- Hk : h(x)P(dx) = hh, µP iHk . For so-called characteris- known meta-distribution over probability distributions, and tic kernels (Sriperumbudur et al., 2010), every probability take a two-stage approach, illustrated as in Figure 1. De- measure has a unique embedding, and thus µP completely noting the feature map k(·, x) ∈ Hk by φ(x), one uses the determines the corresponding probability measure. empirical kernel mean estimator (3) to separately estimate
Ho Chung Leon Law∗ , Danica J. Sutherland∗ , Dino Sejdinovic, Seth Flaxman Square Error Loss ∈ Rb Output Layer b̃ + β ⊤µ̂j {µ̂i}bi=1 ∈ Rb×d Mean Pooling φ(Xi) ∈ RNi ×d for i = 1 . . . b Landmarks {uℓ}dℓ=1 [k(Xi, u1), . . . , k(Xi, ud)] Xi ∈ RNi×p for i = 1 . . . b Figure 2: Our baseline model, a RBF network for distri- bution regression. Xi represents the matrix of samples for bag i, while k(Xi , u` ) represents the element wise opera- Figure 1: Each bag is summarised by a kernel mean embed- tion on each row of Xi , with b representing the batch size ding µi ∈ Hk ; a regression function f : Hk → R predicts for stochastic gradient descent. labels yi ∈ R. We propose a Bayesian approach to propa- gate uncertainty due to the number of samples in each bag, obtaining posterior credible intervals illustrated in grey. begin with a non-Bayesian RBF network formulation of the standard approach to distribution regression as a base- the mean of each group: line, before refining this approach to better propagate un- certainty in bag size, as well as model parameters. N1 Nn 1 X 1 X c1 = µ φ(x1j ), . . . , µ cn = φ(xnj ). (5) N1 j=1 Nn i=1 3.1 Baseline Model Next, one uses kernel ridge regression (Saunders et al., 1998) to learn a function f : Hk → R, by minimizing The baseline RBF network formulation we employ here the squared loss with an RKHS complexity penalty: is a variation of the approaches of Broomhead and Lowe X (1988), Que and Belkin (2016), Law et al. (2017), and Za- fˆ = argmin (yi − f (µbi ))2 + λkf k2HK . heer et al. (2017). As shown in Figure 2, the initial input f ∈HK i is a minibatch consisting of several bags Xi , each contain- ing Ni points. Each point is then converted to an explicit Here K : Hk × Hk → R is a “second-level” kernel on featurisation, taking the role of φ in (5), by a radial basis mean embeddings. If K is a linear kernel on the RKHS layer: xij ∈ Rp is mapped to Hk , then the resulting method can be interpreted as a linear (ridge) regression on mean embeddings, which are them- selves nonlinear transformations of the inputs. A nonlin- φ(xij ) = [k(xij , u1 ), . . . , k(xij , ud )]> ∈ Rd ear second-level kernel on Hk sometimes improves perfor- mance (Muandet et al., 2012; Szábo et al., 2016). where u = {u` }d`=1 are landmark points. A mean pool- Distribution regression as described is not scalable for ing layer yields the estimated mean embedding µ̂i corre- even modestly-sized datasets, as computing each of the sponding to each Pof the bags j represented in the minibatch, φ(xij ).1 Finally, a fully connected Ni O(n2 ) entries of the relevant kernel matrix requires time where µ̂i = N1i j=1 O(Ni Nj ). Many applications have thus used variants of output layer gives real-valued labels ŷi = β TP µ̂i + b. As a random Fourier features (Rahimi and Recht, 2007). In this loss function we use the mean square error n1 i (ŷi − yi )2 . paper we instead expand in terms of landmark points drawn For learning, we use backpropagation with the Adam opti- randomly from the observations, yielding radial basis net- mizer (Kingma and Ba, 2015). To regularise the network, works (Broomhead and Lowe, 1988) with mean pooling. we use early stopping on a validation set, as well as an L2 penalty corresponding to a normal prior on β. 3 MODELS 1 In the implementation, we stack all of the bags Xi into a We consider here three different Bayesian models, with single matrix of size j Nj × d for the first layer, then perform P each model encoding different types of uncertainty. We pooling via sparse matrix multiplication.
Bayesian Approaches to Distribution Regression # $ % MAP Objective J(α) = log p(α) bi=1 p(yi|Xi , α) variance, in order to propagate this information regarding ! " α ∼ N 0, ρ2K −1 uncertainty from the bag size through the model. Bayesian tools provide a natural framework for this problem. N (α⊤R!(R + Σi /Ni) −1µ̂i , " α⊤ R − R (R + Σi/Ni ) −1R α + σ 2 ) We can use the Bayesian nonparametric prior over kernel for i = 1 . . . b ! " mean embeddings (Flaxman et al., 2016) described in Sec- Predictive Distribution yi | µi , α ∼ N α⊤ µi (u), σ 2 tion 2.2, and observe the empirical embeddings at the land- ! " N R (R + Σi /Ni) −1µ̂i , R − R (R + Σi /Ni) −1R mark points ui . For ui , we take a fixed set of landmarks, which we can choose via k-means clustering or sample Posterior Of Embedding for i = 1 . . . b without replacement (Que and Belkin, 2016). Using the µi ∼ GP (0, r(·, ·)) {µ̂i}bi=1 ∈ Rb×d conjugacy of the model to the Gaussian process prior µi ∼ GP(m0 , ηr(., .)), we obtain a closed-form posterior Gaus- Mean Pooling sian process whose evaluation at points h = {hs }ns=1 h is: φ(X ) ∈ RNi ×d i −1 µi (h) | xi ∼ N Rh (R + Σi /Ni ) (µ̂i − m0 ) + m0 , for i = 1 . . . b Landmarks u = {uℓ}ℓ=1 [k(Xi, u1), . . . , k(Xi, ud)] d −1 Rhh − Rh (R + Σi /Ni ) Rh> Xi ∈ RNi×p where Rst = ηr(us , ut ), (Rhh )st = ηr(hs , ht ), (Rh )st = for i = 1 . . . b ηr(hs , ut ), and xi denotes the set {xij }N j=1 . We take the i Figure 3: Our Bayesian mean shrinkage pooling model. prior mean m0 to be the average of the µ̂i ; under a lin- This diagram takes m0 = 0, η = 1 and u = z, so that ear kernel K, this means we shrink predictions towards the R = Rz = Rzz , and Kz = K. mean prediction. Note η essentially controls the strength of the shrinkage: a smaller η means we shrink more strongly towards m0 . We take Σi to be the average of the empirical 3.2 Bayesian Linear Regression Model covariance of {ϕ(xij )}N j=1 across all bags, to avoid poor es- i timation of Σi for smaller bags. More intuition about the The most obvious approach to adding uncertainty to the behaviour of this estimator can be found in Appendix C. model of Section 3.1 is to encode uncertainty over regres- sion parameters β only, as follows: Now, supposing we have normal observation error σ 2 , and use a linear kernel as our second level kernel K, we have: β ∼ N (0, ρ2 ) yi | xi , β ∼ N (β T µ̂i , σ 2 ). yi | µi , f ∼ N hf, µi iHk , σ 2 (6) This is essentially Bayesian linear regression on the empiri- where f ∈ Hk . Clearly, this is P difficult to work with; cal mean embeddings, and is closely related to the model of s hence we parameterise f as f = `=1 α` k(·, z` ), where Flaxman et al. (2015). Here, we are working directly with z = {z` }s`=1 is a set of landmark points for f , which we the finite-dimensional µ̂i , unlike the infinite-dimensional can learn or fix. (Appendix D gives a motivation for this µi before. Due to the conjugacy of the model, we can eas- approximation using the representer theorem.) Using the ily obtain the predictive distribution yi | xi , integrating out reproducing property, our likelihood model becomes: the uncertainty over β. This provides us with uncertainty intervals for the predictions yi . yi | µi , α ∼ N αT µi (z), σ 2 (7) For model tuning, we can maximise the model evidence, where µi (z) = [µi (z1 ), . . . , µi (zs )]> . For fixed α and z i.e. the marginal log-likelihood (see Bishop (2006) for de- we can analytically integrate out the dependence on µi , and tails), and use backpropagation through the network to the predictive distribution of a bag label becomes learn σ and ρ and any kernel parameters of interest.2 yi | xi , α ∼ N (ξiα , νiα ) α > Σi −1 3.3 Bayesian Mean Shrinkage Model ξi = α Rz R + (µ̂i − m0 ) + αT m0 Ni −1 ! A shortcoming of the prior models, and of the standard ap- Σi α T proach in Szábo et al. (2016), is that they ignore uncertainty νi = α Rzz − Rz R + Rz α + σ 2 . T Ni in the first level of estimation due to varying number of samples in each bag. Ideally we would estimate not just the The prior α ∼ N (0, ρ2 Kz−1 ), where Kz is the kernel ma- mean embedding per bag, but also a measure of the sample trix on z, gives the standard regularisation on f of kf k2Hk . 2 Note that unlike the other models considered in this paper, The log-likelihood objective becomes we cannot easily do minibatch stochastic gradient descent, as the n ( ) 1X 2 marginal log-likelihood does not decompose for each individual α (yi − ξiα ) αT Kz α log νi + + . data point. 2 i=1 ξiα 2ρ2
Ho Chung Leon Law∗ , Danica J. Sutherland∗ , Dino Sejdinovic, Seth Flaxman We can use backpropagation to learn the parameters α, σ, There have also been some Bayesian approaches in related and if we wish η, z, and any kernel parameters. The full contexts, though most do not follow our setting where the model is illustrated in Figure 3. This approach allows us to label is a function of the underlying distribution rather than directly encode uncertainty based on bag size in the objec- the observed sample set. Kück and de Freitas (2005) con- tive function, and gives probabilistic predictions. sider an MCMC method with group-level labels but focus on individual-level classifiers, while Jackson et al. (2006) 3.4 Bayesian Distribution Regression use hierarchical Bayesian models on both individual-level and aggregate data for ecological inference. It is natural to combine the two Bayesian models above, Jitkrittum et al. (2015) and Flaxman et al. (2015) quantify fully propagating uncertainty in estimation of the mean the uncertainty of distribution regression models by inter- embedding and of the regression coefficients α. Unfortu- preting the kernel ridge regression on embeddings as Gaus- nately, conjugate Bayesian inference is no longer available. sian process regression. However, the former’s setting has Thus, we consider a Markov Chain Monte Carlo (MCMC) no uncertainty in the mean embeddings, while the latter’s sampling based approach, and here use Hamiltonian Monte treats empirical embeddings as fixed inputs to the learning Carlo (HMC) for efficient inference, though any MCMC- problem (as in Section 3.2). type scheme would work. Whereas inference above used gradient descent to maximise the marginal likelihood, with There has also been generic work on input uncertainty the gradient calculated using automatic differentiation, here in Gaussian process regression (Girard, 2004; Damianou we use automatic differentiation to calculate the gradient of et al., 2016). These methods could provide a framework the joint log-likelihood and follow this gradient as we per- towards allowing for second-level kernels in our models. form sampling over the parameters we wish to infer. One could also, though, consider regression with uncertain inputs as a special case of distribution regression, where the We can still exploit the conjugacy of the mean shrinkage label is a function of the distribution’s mean and Ni = 1. layer, obtaining an analytic posterior over the mean em- beddings. Conditional on the mean embeddings, we have a Bayesian linear regression model with parameters α. We 5 EXPERIMENTS sample this model with the NUTS HMC sampler (Hoffman and Gelman, 2014; Stan Development Team, 2014). We will now demonstrate our various Bayesian approaches: the mean-shrinkage pooling R method with r = k (shrink- 4 RELATED WORK age) and with r(x, x0 ) = k(x, z)k(z, x0 )ν(dz) for ν pro- portional to a Gaussian measure (shrinkageC), Bayesian As previously mentioned, Szábo et al. (2016) provides linear regression (BLR), and the full Bayesian distribu- a thorough learning-theoretic analysis of the regression tion regression model with r = k (BDR). We also com- model discussed in Section 2.3. This formalism consid- pare the non-Bayesian baselines RBF network (Section 3.1) ering a kernel method on distributions using their embed- and freq-shrinkage, which uses the shrinkage estimator of ding representations, or various scalable approximations Muandet et al. (2014) to estimate mean embeddings. Code to it, has been widely applied (e.g. Muandet et al., 2012; for our methods and to reproduce the experiments is avail- Yoshikawa et al., 2014; Flaxman et al., 2015; Jitkrittum able at https://github.com/hcllaw/bdr. et al., 2015; Lopez-Paz et al., 2015; Mitrovic et al., 2016). There are also several other notions of similarities on distri- We first demonstrate the characteristics of our models on a butions in use (not necessarily falling within the framework synthetic dataset, and then evaluate them on a real life age of kernel methods and RKHSs), as well as local smoothing prediction problem. Throughout, for simplicity, we take approaches, mostly based on estimates of various probabil- u = z, i.e. R = Rz = Rzz , and Kz = K – although ity metrics (Moreno et al., 2003; Jebara et al., 2004; Póczos u and z could be different, with z learnt. Here k is the et al., 2011; Oliva et al., 2013; Poczos et al., 2013; Kusner standard RBF kernel. We tune the learning rate, number et al., 2015). For a partial overview, see Sutherland (2016). of landmarks, bandwidth of the kernel and regularisation parameters on a validation set. For BDR, we use weakly Other related problems of learning on instances with informative normal priors (possibly truncated at zero); for group-level labels include learning with label proportions other models, we learn the remaining parameters. (Quadrianto et al., 2009; Patrini et al., 2014), ecological inference (King, 1997; Gelman et al., 2001), pointillistic 5.1 Gamma Synthetic Data pattern search (Ma et al., 2015), multiple instance learning (Dietterich et al., 1997; Kück and de Freitas, 2005; Zhou We create a synthetic dataset by repeatedly sampling from et al., 2009; Krummenacher et al., 2013) and learning with the following hierarchical model, where yi is the label for sets (Zaheer et al., 2017).3 the ith bag, each xij ∈ R5 has entries i.i.d. according to 3 For more, also see giorgiopatrini.org/nips15workshop. the given distribution, and ε is an added noise term which
Bayesian Approaches to Distribution Regression 1.4 ated dataset, 25% of the bags have Ni = 20, and 25% have 1.2 Ni = 100. Among the other half of the data, we vary the ratio of Ni = 5 and Ni = 1 000 bags to demonstrate the predictive mean NLL 1.0 methods’ efficacy at dealing with varied bag sizes: we let 0.8 s5 be the overall percentage of bags with Ni = 5, ranging 0.6 from s5 = 0 (in which case no bags have size Ni = 5) to uniform 0.4 BLR s5 = 50 (in which case 50% of the overall bags have size 0.2 shrinkageC Ni = 5). Here we do not add additional noise: ε = 0. shrinkage 0.0 BDR Results are shown in Figure 4. BDR and shrinkage meth- optimal ods, which take into account bag size uncertainty, per- 0% 10% 20% 30% 40% 50% form well here compared to the other methods. The full proportion with n = 5 BDR model very slightly outperforms the Bayesian shrink- 0.8 age models in both likelihood and in mean-squared error; 0.7 frequentist shrinkage slightly outperforms the Bayesian 0.6 shrinkage models in MSE, likely because it is tuned for that metric. We also see that the choice of r affects the results; 0.5 MSE RBF network r = k does somewhat better. BLR 0.4 shrinkageC Figure 5 demonstrates in more detail the difference be- 0.3 freq-shrinkage tween these models. It shows test set predictions of each shrinkage 0.2 BDR model on the bags of different sizes. Here, we can see optimal explicitly that the shrinkage and BDR models are able to 0% 10% 20% 30% 40% 50% take into account the bag size, with decreasing variance proportion with n = 5 for larger bag sizes, while the BLR model gives the same variance for all outputs. Furthermore, the shrinkage and Figure 4: Top: negative log-likelihood. Bottom: mean- BDR models can shrink their predictions towards the mean squared error. For context, performance of the Bayes- more for smaller bags than larger ones: this improves per- optimal predictor is also shown, and for NLL ‘uniform’ formance on the small bags while still allowing for good shows the performance of a uniform prediction on the pos- predictions on large bags, contrary to the BLR model. sible labels. For MSE, the constant overall mean label pre- dictor achieves about 1.3. Fixed bag size: Uncertainty in the regression model. The previous experiment showed the efficacy of the shrink- differs for the two experiments below: age estimator in our models, but demonstrated little gain from posterior inference for regression weights β over their yi ∼ Uniform(4, 8) MAP estimates, i.e. there is no discernible improvement of BLR over RBF network. To isolate the effect of quantify- i iid 1 yi 1 xj ` | yi ∼ Γ , + ε for j ∈ [Ni ], ` ∈ [5]. ing uncertainty in the regression model, we now consider yi 2 2 the case where there is no variation in bag size at all and normal noise is added onto the observations. In particular In these experiments, we generate 1 000 bags for training, we take Ni = 1000 and ε ∼ N (0, 1), and sample land- 500 bags for a validation set for parameter tuning, 500 marks randomly from the training set. bags to use for early-stopping of the models, and 1 000 bags for testing. Tuning is performed to maximize log- Results are shown in Table 1. Here, BLR or BDR outper- likelihoods for Bayesian models, MSE for non-Bayesian form all other methods on all runs, highlighting that uncer- models. Landmark points u are chosen via k-means (fixed tainty in the regression model is also important for predic- across all models). We also show results of the Bayes- tive performance. Importantly, the BDR method performs optimal model, which gives true posteriors according to well in this regime as well as in the previous one. the data-generating process; this is the best performance any model could hope to achieve. Our learning models, 5.2 IMDb-WIKI: Age Estimation which treat the inputs as five-dimensional, fully nonpara- metric distributions, are at a substantial disadvantage even We now demonstrate our methods on a celebrity age es- in how they view the data compared to this true model. timation problem, using the IMDb-WIKI database (Rothe et al., 2016) which consists of 397 949 images of 19 545 Varying bag size: Uncertainty in the inputs. In order celebrities4 , with corresponding age labels. This database to study the behaviour of our models with varying bag size, 4 We used only the IMDb images, and removed some implau- we fix four sizes Ni ∈ {5, 20, 100, 1 000}. For each gener- sible images, including one of a cat and several of people with
Ho Chung Leon Law∗ , Danica J. Sutherland∗ , Dino Sejdinovic, Seth Flaxman BLR shrinkage BDR optimal 9 8 7 Ni = all 6 5 4 NLL = 1.162 NLL = 0.738 NLL = 0.708 NLL = 0.372 3 MSE = 0.613 MSE = 0.465 MSE = 0.430 MSE = 0.350 9 8 7 Ni = 5 6 5 4 NLL = 1.767 NLL = 1.417 NLL = 1.403 NLL = 1.183 3 MSE = 1.337 MSE = 0.994 MSE = 0.925 MSE = 0.814 9 8 7 Ni = 20 6 5 4 NLL = 1.161 NLL = 1.214 NLL = 1.184 NLL = 0.875 3 MSE = 0.597 MSE = 0.650 MSE = 0.596 MSE = 0.464 9 8 7 Ni = 100 6 5 4 NLL = 0.867 NLL = 0.572 NLL = 0.536 NLL = 0.216 3 MSE = 0.265 MSE = 0.182 MSE = 0.168 MSE = 0.110 9 8 7 Ni = 1000 6 5 4 NLL = 0.855 NLL = -0.251 NLL = -0.293 NLL = -0.786 3 MSE = 0.253 MSE = 0.034 MSE = 0.031 MSE = 0.014 4 5 6 7 84 5 6 7 84 5 6 7 84 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 predictive std dev Figure 5: Predictions for the varying bag size experiment of Section 5.1. Each column corresponds to a single prediction method. Each point in an image represents a single bag, with its horizontal position the true label yi , and its vertical position the predicted label. The black lines show theoretical perfect predictions. The rows represent different subsets of the data: the first row shows all bags, the second only bags with Ni = 5, and so on. Colours represent the predictive standard deviation of each point.
Bayesian Approaches to Distribution Regression Table 1: Results on the fixed bag size dataset, over rameter tuning, we relegate this to a later study. 10 dataset draws (standard deviations in parentheses). BLR/BDR perform best on all runs in both metrics. We use 9 820 bags for training, 2 948 bags for early stop- ping, 2 946 for validation and 3 928 for testing. Landmarks are sampled without replacement from the training set. METHOD MSE NLL We repeat the experiment on 10 different splits of the data, Optimal 0.170 (0.009) 0.401 (0.018) and report the results in Table 2. The baseline CNN results RBF network 0.235 (0.014) – give performance by averaging the predictive distribution freq-shrinkage 0.232 (0.012) – from the model of Rothe et al. (2016) for each image of a shrinkage 0.237 (0.014) 0.703 (0.027) bag; note that this model was trained on all of the images shrinkageC 0.236 (0.013) 0.700 (0.029) used here. From Table 2, we can see that the shrinkage BLR 0.228 (0.012) 0.681 (0.025) methods have the best performance; they outperforms all BDR 0.227 (0.012) 0.683 (0.025) other methods in all 10 splits of the dataset, in both met- rics. Non-Bayesian shrinkage again yields slightly better RMSEs, likely because it is tuned for that metric. This Table 2: Results on the grouped IMDb-WIKI dataset over demonstrates that modelling bag size uncertainty is vital. ten runs (standard deviations in parentheses). Here shrink- age methods perform the best across all 10 runs. 6 CONCLUSION METHOD RMSE NLL Supervised learning on groups of observations using ker- nel mean embeddings typically disregards sampling vari- CNN 10.25 (0.22) 3.80 (0.034) ability within groups. To handle this problem, we con- RBF network 9.51 (0.20) – struct Bayesian approaches to modelling kernel mean em- freq-shrinkage 9.22 (0.19) – beddings within a regression model, and investigate advan- shrinkage 9.28 (0.20) 3.54 (0.021) tages of uncertainty propagation within different compo- BLR 9.55 (0.19) 3.68 (0.021) nents of the resulting distribution regression. The ability to take into account the uncertainty in mean embedding es- timates is demonstrated to be key for constructing mod- was constructed by crawling IMDb for images of its most els with good predictive performance when group sizes are popular actors and directors, with potentially many images highly imbalanced. We also demonstrate that the results of for each celebrity over time. Rothe et al. (2016) use a con- a complex neural network model for age estimation can be volutional neural network (CNN) with a VGG-16 architec- improved by shrinkage. ture to perform 101-way classification, with one class cor- responding to each age in {0, . . . , 100}. Our models employ a neural network formulation to pro- vide more expressive feature representations and learn dis- We take a different approach, and assume that we are given criminative embeddings. Doing so makes our model easy several images of a single individual (i.e. samples from to extend to more complicated featurisations than the sim- the distribution of celebrity images), and are asked to pre- ple RBF network used here. By training with backpropa- dict their mean age based on several pictures. For example, gation, or via approximate Bayesian methods such as vari- we have 757 images of Brad Pitt from age 27 up to 51, ational inference, we can easily ‘learn the kernel’ within while we have only 13 images of Chelsea Peretti at ages 35 our framework, for example fine-tuning the deep network and 37. Note that 22.5% of bags have only a single image. of Section 5.2 rather than using a pre-trained model. We We obtain 19 545 bags, with each bag containing between can also apply our networks to structured settings, learning 1 and 796 images of a particular celebrity, and the corre- regression functions on sets of images, audio, or text. Such sponding bag label calculated from the average of the age models naturally fit into the empirical Bayes framework. labels of the images inside each bag. On the other hand, we might extend our model to more In particular, we use the representation ϕ(x) learnt by the Bayesian feature learning by placing priors over the kernel CNN in Rothe et al. (2016), where ϕ(x) : R256×256 → hyperparameters, building on classic work on variational R4096 maps from the pixel space of images to the CNN’s approaches (Barber and Schottky, 1998) and fully Bayesian last hidden layer. With these new representations, we can inference (Andrieu et al., 2001) in RBF networks. Such now treat them as inputs to our radial basis network, shrink- approaches are also possible using other featurisations, e.g. age (taking r = k here) and BLR models. Although we random Fourier features (as in Oliva et al., 2015). could also use the full BDR model here, due to the compu- tational time and memory required to perform proper pa- Future distribution regression approaches will need to ac- count for uncertainty in observation of the distribution. Our supposedly negative age, or ages of several hundred years. methods provide a strong, generic building block to do so.
Ho Chung Leon Law∗ , Danica J. Sutherland∗ , Dino Sejdinovic, Seth Flaxman References Wittawat Jitkrittum, Arthur Gretton, Nicolas Heess, S. M. Ali Eslami, Balaji Lakshminarayanan, Dino Se- Christophe Andrieu, Nando De Freitas, and Arnaud jdinovic, and Zoltán Szabó. Kernel-Based Just-In-Time Doucet. Robust full bayesian learning for radial ba- Learning for Passing Expectation Propagation Mes- sis networks. Neural Computation, 13(10):2359–2407, sages. In UAI, 2015. 2001. Gary King. A Solution to the Ecological Inference Problem. David Barber and Bernhard Schottky. Radial basis func- Princeton University Press, 1997. ISBN 0691012407. tions: a bayesian treatment. NIPS, pages 402–408, 1998. Christopher M. Bishop. Pattern recognition and machine Diederik Kingma and Jimmy Ba. Adam: A method learning. Springer New York, 2006. for stochastic optimization. In ICLR, 2015. arXiv:1412.6980. David S Broomhead and David Lowe. Radial basis func- tions, multi-variable functional interpolation and adap- Gabriel Krummenacher, Cheng Soon Ong, and Joachim M tive networks. Technical report, DTIC Document, 1988. Buhmann. Ellipsoidal multiple instance learning. In ICML (2), pages 73–81, 2013. Andreas C. Damianou, Michalis K. Titsias, and Neil D. Lawrence. Variational inference for latent variables and Hendrik Kück and Nando de Freitas. Learning about in- uncertain inputs in Gaussian processes. JMLR, 17(42): dividuals from group statistics. In UAI, pages 332–339, 1–62, 2016. 2005. Thomas G. Dietterich, Richard H. Lathrop, and Tomás Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Wein- Lozano-Pérez. Solving the multiple instance problem berger. From word embeddings to document distances. with axis-parallel rectangles. Artificial intelligence, 89 In ICML, pages 957–966, 2015. (1):31–71, 1997. Ho Chung Leon Law, Christopher Yau, and Dino Se- Seth Flaxman, Yu-Xiang Wang, and Alexander J Smola. jdinovic. Testing and learning on distributions Who supported Obama in 2012?: Ecological inference with symmetric noise invariance. In NIPS, 2017. through distribution regression. In KDD, pages 289–298. arXiv:1703.07596. ACM, 2015. David Lopez-Paz, Krikamol Muandet, Bernhard Seth Flaxman, Dino Sejdinovic, John P. Cunningham, and Schölkopf, and Ilya Tolstikhin. Towards a learn- Sarah Filippi. Bayesian learning of kernel embeddings. ing theory of cause-effect inference. In ICML, 2015. In UAI, 2016. Milan Lukić and Jay Beder. Stochastic processes with sam- Seth Flaxman, Danica J. Sutherland, Yu-Xiang Wang, ple paths in reproducing kernel hilbert spaces. Trans- and Yee-Whye Teh. Understanding the 2016 US pres- actions of the American Mathematical Society, 353(10): idential election using ecological inference and dis- 3945–3969, 2001. tribution regression with census microdata. 2016. arXiv:1611.03787. Yifei Ma, Danica J. Sutherland, Roman Garnett, and Jeff Schneider. Active pointillistic pattern search. In AIS- Thomas Gärtner. Kernels for Structured Data, volume 72. TATS, 2015. World Scientific, Series in Machine Perception and Arti- ficial Intelligence, 2008. Jovana Mitrovic, Dino Sejdinovic, and Yee-Whye Teh. DR-ABC: Approximate Bayesian Computation with Andrew Gelman, David K Park, Stephen Ansolabehere, Kernel-Based Distribution Regression. In ICML, pages Phillip N. Price, and Lorraine C. Minnite. Models, as- 1482–1491, 2016. sumptions and model checking in ecological regressions. Journal of the Royal Statistical Society: Series A (Statis- Pedro J Moreno, Purdy P Ho, and Nuno Vasconcelos. A tics in Society), 164(1):101–118, 2001. Kullback-Leibler divergence based kernel for SVM clas- Agathe Girard. Approximate methods for propagation of sification in multimedia applications. In NIPS, 2003. uncertainty with Gaussian process models. PhD thesis, Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, University of Glasgow, 2004. and Bernhard Schölkopf. Learning from distribu- Matthew D. Hoffman and Andrew Gelman. The no-U-turn tions via support measure machines. In NIPS, 2012. sampler: Adaptively setting path lengths in Hamiltonian arXiv:1202.6504. Monte Carlo. JMLR, pages 1593–1623, 2014. Krikamol Muandet, Kenji Fukumizu, Bharath Sriperum- Christopher Jackson, Nicky Best, and Sylvia Richardson. budur, Arthur Gretton, and Bernhard Schoelkopf. Kernel Improving ecological inference using individual-level mean estimation and stein effect. In ICML, 2014. data. Statistics in medicine, 25(12):2136–2159, 2006. Michelle Ntampaka, Hy Trac, Danica J. Sutherland, Tony Jebara, Risi Imre Kondor, and Andrew Howard. Prob- Nicholas Battaglia, Barnabás Póczos, and Jeff Schnei- ability product kernels. JMLR, 5:819–844, 2004. der. A machine learning approach for dynamical
Bayesian Approaches to Distribution Regression mass measurements of galaxy clusters. The Astro- Stan Development Team. Stan: A C++ library for proba- physical Journal, 803(2):50, 2015. ISSN 1538-4357. bility and sampling, version 2.5.0, 2014. URL http: arXiv:1410.0686. //mc-stan.org/. Michelle Ntampaka, Hy Trac, Danica J. Sutherland, Sebas- Ingo Steinwart. Convergence types and rates in generic tian Fromenteau, Barnabás Póczos, and Jeff Schneider. Karhunen-Loéve expansions with applications to sam- Dynamical mass measurements of contaminated galaxy ple path properties. arXiv preprint arXiv:1403.1040v3, clusters using machine learning. The Astrophysical Jour- March 2017. nal, 831(2):135, 2016. arXiv:1509.05409. Danica J. Sutherland. Scalable, Flexible, and Active Learn- Junier B Oliva, Barnabás Póczos, and Jeff Schneider. Dis- ing on Distributions. PhD thesis, Carnegie Mellon Uni- tribution to distribution regression. In ICML, 2013. versity, 2016. Junier B Oliva, Avinava Dubey, Barnabás Póczos, Jeff Zoltán Szábo, Bharath K. Sriperumbudur, Barnabás Schneider, and Eric P Xing. Bayesian nonparametric Póczos, and Arthur Gretton. Leraning theory for kernel-learning. In AISTATS, 2015. arXiv:1506.08776. distribution regression. JMLR, 17(152):1–40, 2016. arXiv:1411.2066. Giorgio Patrini, Richard Nock, Tiberio Caetano, and Paul Rivera. (Almost) no label no cry. In NIPS. 2014. Grace Wahba. Spline models for observational data, vol- ume 59. Siam, 1990. Natesh S Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, and Robert L Wolpert. Characterizing the function space Yuya Yoshikawa, Tomoharu Iwata, and Hiroshi Sawada. for bayesian kernel models. JMLR, 8(Aug):1769–1797, Latent support measure machines for bag-of-words data 2007. classification. In NIPS, pages 1961–1969, 2014. Barnabás Póczos, Liang Xiong, and Jeff Schneider. Non- Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, parametric divergence estimation with applications to Barnabás Póczos, Ruslan Salakhutdinov, and Alexander machine learning on distributions. In UAI, 2011. Smola. Deep sets. In NIPS, 2017. Barnabás Póczos, Aarti Singh, Alessandro Rinaldo, Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li. Multi- and Larry Wasserman. Distribution-free distribu- instance learning by treating instances as non-iid sam- tion regression. In AISTATS, pages 507–515, 2013. ples. In ICML, 2009. arXiv:1302.0082. Novi Quadrianto, Alex J Smola, Tiberio S Caetano, and Quoc V Le. Estimating labels from label proportions. JMLR, 10:2349–2374, 2009. Qichao Que and Mikhail Belkin. Back to the future: Radial basis function networks revisited. In AISTATS, 2016. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS, pages 1177–1184, 2007. Aaditya Ramdas and Leila Wehbe. Nonparametric inde- pendence testing for small sample sizes. In IJCAI, 2015. arXiv:1406.1922. Rasmus Rothe, Radu Timofte, and Luc Van Gool. Deep expectation of real and apparent age from a single im- age without facial landmarks. International Journal of Computer Vision (IJCV), July 2016. Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algorithm in dual vari- ables. In ICML, 1998. Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In COLT, 2001. Bharath K Sriperumbudur, Arthur Gretton, Kenji Fuku- mizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. JMLR, 99:1517–1561, 2010.
Ho Chung Leon Law∗ , Danica J. Sutherland∗ , Dino Sejdinovic, Seth Flaxman A Choice of r(·, ·) to ensure µP ∈ Hk We need to choose an appropriate covariance function r, such that µP ∈ Hk , where µP ∼ GP(0, r(·, ·)). In particular, it is for infinite-dimensional RKHSs not sufficient to define r(·, ·) = k(·, ·), as draws from this particular prior are no longer in Hk (Wahba, 1990) (but see below). However, we can construct Z r(x, y) = k(x, z)k(z, y)ν(dz) (8) where ν is any finite measure on X . This then ensures µP ∈ Hk with probability 1 by the nuclear dominance (Lukić and Beder, 2001; Pillai et al., 2007) for any stationary kernel k. In particular, Flaxman et al. (2016) provides details when k is a squared exponential kernel defined by 1 k(x, y) = exp(− (x − y)> Σ−1 k (x − y)) x, y ∈ Rp 2 ||z||2 and ν(dz) = exp − 2`22 dz, i.e. it is proportional to a Gaussian measure on Rd , which provides r(·, ·) with a non- stationary component. In this paper, we take Σk = σ 2 Ip , where σ 2 and ` are tuning parameters, or parameters that we learn. Here, the above holds for a general set of stationary kernels, but note that by taking a convolution of a kernel with itself, it might make the space of functions that we consider overly smooth (i.e. concentrated on a small part of Hk ). In this work, however, we consider only the Gaussian RBF kernel k. In fact, recent work (Steinwart, 2017, Theorem 4.2) actually shows that in this case, the sample paths almost surely belong to (interpolation) spaces which are infinitesimally larger than the RKHS of the Gaussian RBF kernel. This suggests that we can choose r to be an RBF kernel with a length scale that is infinitesimally bigger than that of k; thus, in practice, taking r = k would suffice and we do observe that it actually performs better (Fig. 4). B Framework for Binary Classification Suppose that our labels yi ∈ {0, 1}, i.e. we are in a binary classification framework. Then a simple approach to accounting for uncertainty in the regression parameters is to use bayesian logistic regression, putting priors on β, i.e. β ∼ N (0, ρ2 ) πi yi ∼ Ber(πi ), where log = β > µ̂i 1 − πi however for the mean shrinkage pooling model, if we use the above yi | µi , α, we would not be able to obtain an analytical solution for p(yi |xi , α). Instead we use the probit link function, as given by: P r(yi = 1|µi , α) = Φ α> µi (z) where Φ denotes the Cumulative Distribution Function (CDF) of a standard normal distribution, with µi (z) = [µi (z1 ), . . . , µi (zs )]> . Then as before we have µi (z) | xi ∼ N (Mi , Ci ) with Mi and Ci as defined in section 3.3. Hence, as before Z P r(yi = 1|xi , α) = P r(yi = 1|µi , α)p(µi (z)|xi )dµi (z) Z 1 = c Φ(α> µi (z)) exp{− (µi (z) − Mi )> Ci−1 (µi (z) − Mi )}dµi (z) 2 Z 1 (with li = µi (z) − Mi ) = c Φ(α> (li + Mi )) exp{− (li )> Ci−1 (li )}dli 2 = P r(Y ≤ α> (li + Mi ))
Bayesian Approaches to Distribution Regression Note here Y ∼ N (0, 1) and li ∼ N (0, Σi ) Then expanding and rearranging P r(yi = 1|xi , α) = P r(Y − α> li ≤ α> Mi ) Note that since Y and li independent normal r.v., Y − α> li ∼ N (0, 1 + α> Ci α> ). Let T be standard normal, then we have: p P r(yi = 1|xi , α) = P r( 1 + α> Ci α T ≤ α> Mi ) α > Mi = P r(T ≤ p ) 1 + α> Ci α ! α > Mi = Φ p 1 + α> Ci α Hence, we also have: ! α > Mi P r(yi = 0|xi , α) = 1−Φ p 1 + α > Ci α Now placing the prior α ∼ N (0, ρ2 Kz−1 ), we have the following MAP objective: " n # Y J(α) = log p(α) p(yi |xi , α) i=1 n ! X α > Mi = (1 − yi ) log(1 − Φ p ) i=1 1 + α > Ci α ! α > Mi 1 +yi log(Φ p ) + 2 α> Kz α > 1 + α Ci α ρ Since we have an analytical solution for P r(yi = 0|xi , α), we can also use this in HMC for BDR. C Some more intuition on the shrinkage estimator In this section, we provide some intuition behind the shrinkage estimator in section 3.3. Here, for simplicity, we choose Σi = τ 2 I for all bag i, and m0 = 0, and consider the case where z = u, i.e. R = Rz = Rzz . We can then see that if R has eigendecomposition U ΛU T , with Λ = diag(λk ), the posterior mean is λk U diag U T (µ̂i ), λk + τ 2 /Ni so that large eigenvalues, λk τ 2 /Ni , are essentially unchanged, while small eigenvalues, λk τ 2 /Ni , are shrunk towards 0. Likewise, the posterior variance is ! ! λ2k 1 U diag λk − τ2 U T = U diag Ni 1 UT ; λk + N τ 2 + λ i k its eigenvalues also decrease as Ni /τ 2 increases. D Alternative Motivation for choice of f Pk Here we provide an alternative motivation for the choice of f = s=1 αs k(·, zs ). First, consider the following Bayesian model with a linear kernel K on µi , where f : Hk → R: yi | µi , f ∼ N f (µi ), σ 2 .
Ho Chung Leon Law∗ , Danica J. Sutherland∗ , Dino Sejdinovic, Seth Flaxman Now considering the log-likelihood of {µ, Y } = {µi , yi }ni=1 (supposing we have these exact embeddings), we obtain: n X 1 log p(Y |µ, f ) = − (yi − f (µi ))2 i=1 2σ 2 To avoid over-fitting, we place a Gaussian prior on f , i.e. − log p(f ) = λ||f ||Hk + c. Minimizing the negative log- likelihood over f ∈ Hk , we have: Xn ∗ 1 f = argminf ∈Hk (y − f (µi ))2 + λ||f ||Hk 2 i i=1 2σ Now this is in the form of an empirical risk minimisation problem. Hence using the representer theorem (Schölkopf et al., 2001), we have that: n X f= γj K(., µj ) j=1 i.e. we have a finite-dimensional problem to solve. Thus since K is a linear kernel: Xn yi | µi , {µj }nj=1 , γ ∼ N γj hµi , µj iHk , σ 2 . j=1 where hµi , µj iHk can be thought of as the similarity between distributions. Now we have the same GP posterior as in Section 3.3, and we would like to compute p(yi |xi , γ). This suggests we need to integrate out µ1 , . . . µn . But it is unclear how to perform this integration, since the µi follow Gaussian process distributions. Pk Hence we can take an approximation to f , i.e. f = s=1 αs k(·, zs ), which would essentially give us a dual method with a sparse approximation to f .
You can also read