Output-Constrained Bayesian Neural Networks - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Output-Constrained Bayesian Neural Networks Wanqian Yang * 1 Lars Lorch * 1 Moritz A. Graule * 1 Srivatsan Srinivasan 1 Anirudh Suresh 1 Jiayu Yao 1 Melanie F. Pradier 1 Finale Doshi-Velez 1 Abstract two domains where the ability to encode such con- Bayesian neural network (BNN) priors are straints is crucial: (i) prediction of clinical actions in arXiv:1905.06287v1 [cs.LG] 15 May 2019 defined in parameter space, making it hard health care, where constraints prevent unsafe actions to encode prior knowledge expressed in for certain physiological inputs, and (ii) human mo- function space. We formulate a prior that tion prediction, where joint positions are constrained incorporates functional constraints about by anatomically feasible ranges. what the output can or cannot be in regions Our contributions are: (a) we introduce constraint of the input space. Output-Constrained priors, capable of incorporating both negative con- BNNs (OC-BNN) represent an interpretable straints (where the function cannot be) and positive approach of enforcing a range of constraints, constraints (where the function should be), applica- fully consistent with the Bayesian frame- ble with any black-box inference algorithm normally work and amenable to black-box inference. used with BNNs, and (b) we demonstrate the applica- We demonstrate how OC-BNNs improve tion of constraint priors with a variety of suitable in- model robustness and prevent the predic- ference methods on toy problems as well as two large tion of infeasible outputs in two real-world and high-dimensional real-world data sets. applications of healthcare and robotics. 2. Related Work 1. Introduction Most closely related to our work, (Lorenzi & Filip- BNNs combine powerful function approximators pone, 2018) considered function-space equality and with the ability to model uncertainty, making them inequality constraints of deep probabilistic models. useful in domains where (i) training data is expen- However, they focused on deep Gaussian processes sive or limited, or (ii) inaccurate predictions are pro- (DGPs) rather than BNNs, and on low-dimensional hibitively costly and decision-making must be in- data from simulated ODE systems, whereas we con- formed by our level of confidence (MacKay, 1995; sider high-dimensional real-world settings. They also Neal, 1995). Domain experts often have prior knowl- do not consider classification settings. edge about the modeled function and the ability to encode such information on top of training data can (Hafner et al., 2018) specify a Gaussian function thus improve performance. However, BNNs define prior with the goal of preventing overconfident BNN prior distributions over parameters, whose high di- predictions out-of-distribution. In contrast, we use mensionality and lack of interpretability make the in- ”positive constraints” to guide the function where it corporation of functional beliefs close to impossible. should be. Also related are functional BNNs by (Sun et al., 2019), where variational inference is performed We present an interpretable approach for incorporat- in function-space using a stochastic process model. ing prior functional information into BNNs in the Their view is more general—and accordingly, more form of constraints, while staying consistent with the complex to optimize—while we focus on constraints Bayesian framework. We then apply our method to in specific regions of the input-output space. * Equal contribution 1 Harvard University. Correspon- dence to: Wanqian Yang , 3. Background Lars Lorch . A conventional BNN, operating in the function (or Presented at the ICML 2019 Workshop on Uncertainty and input-output) space X × Y , typically has a prior over Robustness in Deep Learning and Workshop on Under- parameters p(W ), where W are the neural network standing and Improving Generalization in Deep Learning. Long Beach, CA, 2019. Copyright 2019 by the author(s). weights and biases. Given data D = {xn , yn }nN=1 , we
Output-Constrained Bayesian Neural Networks perform inference to obtain the posterior p(W | D) ∝ knowledge in weight space while retaining the p(W ) p(D | W ). The posterior predictive for the out- weight-space prior p(W ). Intuitively, g(W | C) mea- put y0 for some new input x0 is obtained by integrat- sures the BNN’s adherence to the constrained region. ing over the posterior distribution of W : It remains to describe how g is defined. For positive constraints C + , g measures how close φ(C + Z 0 0 p(y | x , D) = 0 0 p(y | x , W ) p(W | D)d W (1) x ; W ) lies W to C y , for which natural choices of distributions ex- ist for both regression and classification. For nega- The space of W is high-dimensional and the relation- tive constraints C − , we define g as the expected viola- ship between the weights and the function is non- tion of C − − y given φ (C x ; W ) using a classifier function. intuitive. As such, the prior p(W ) is often trivially Complete definitions of g for positive and negative chosen as an isotropic Gaussian: priors are provided in Appendix A; details on infer- p(W ) = ∏ N (W i ; 0, σp2 ) (2) ence procedures are provided in Appendix B. i 5. Demonstrations on Synthetic Data 4. Output-Constrained BNNs This section provides proof of concepts of OC-BNNs We consider two kinds of “expert knowledge”: posi- using 2-dimensional synthetic examples. Refer to Ap- tive constraints define regions where a function should pendix C for experimental details and Appendix D be, and negative constraints define regions where a for additional results. For regression, the posteriors function cannot be. This delineation is not arbitrary are visualized in black/gray for baseline BNNs, and — the level of prior knowledge (strongly vs. weakly blue for OC-BNNs. Negative constrained regions are informative) and the task (regression or classification) red; positive (Gaussian) constraints are green. For may suggest the use of different prior constraints. the classification example, the three classes are color- Defining constrained regions Formally, a positive coded red, green and blue. constrained region C + is a set of input-output tu- OC-BNNs model uncertainty in a manner that re- ples (x, y) defining where outputs given certain in- spects constrained regions while explaining train- puts should be. Conversely, a negative constrained re- ing data. Figure 1 demonstrates this for both the gion C − is a set of tuples (x, y) defining where out- regression and classification setting. Correct predic- puts given certain inputs cannot be. We will use C tions are maintained with similar uncertainty lev- when describing properties of constrained regions of els as the baseline while constraints are correctly en- both kinds and denote C x for all x in C and C y for all y forced with uncertainty levels changing to reflect that. in C . Given this formulation, it is our goal to enforce These examples demonstrate how OC-BNNs enforce Z constraints without sacrificing predictive accuracy. p(y0 ∈ / C+ 0 + + y | x ∈ C x , W ) p (W | C , D) d W ≈ 0 W Z (3) p(y0 ∈ C − 0 − − y | x ∈ C x , W ) p (W | C , D) d W ≈ 0 W Note that (3) is simply the posterior predictive distri- bution conditioned on C . The generality of this ap- proach allows for the incorporation of very compli- cated yet interpretable constraints a priori, such as for example arbitrary equality, inequality and logical (if- Figure 1. On both tasks, OC-BNNs reduce uncertainty in then and either-or) constraints. constrained regions while fitting data well. (left) 1D regres- sion. The constraint is composed of two negative regions Constraint prior We connect the weight space of (red) separated by a small gap. Uncertainty of OC-BNNs the BNN with constraints through the distribution: (blue) drops sharply in the constrained region compared g(W | C) = g(φ(C x ; W ); C y , θ ) (4) to the baseline (gray). (right) 2D classification with three classes. Constrained region enforces the prediction of green where φ(x; W ) is the BNN forward pass and θ is the class in the green rectangle (baseline depicted in inset). set of tuneable hyperparameters of g. Accordingly, a constraint prior pC (W ) can then be constructed as: OC-BNNs encourage correct out-of-distribution be- pC (W ) := p(W ) g(W | C), (5) havior. Figure 2 (left) depicts sparse data, along with out-of-distribution positive constrained regions. The achieving the goal of expressing prior function posterior predictive in-distribution closely mimics the
Output-Constrained Bayesian Neural Networks baseline, while the posterior out-of-distribution (OOD) filtered unfiltered learns to avoid the constrained region. This demon- BNN OC-BNN BNN OC-BNN strates that OC-BNNs function well away from the ACC 0.745 0.741 0.881 0.878 Train data, which is important because we typically want F1 0.805 0.801 0.882 0.880 to enforce functional constraints when there is a lack VIOL 0.151 0.149 N/A N/A of observed training data for the model to learn from. ACC 0.660 0.665 0.647 0.649 Test OC-BNNs can capture posterior multimodality. F1 0.746 0.748 0.725 0.736 While negative constraints C − do not explicitly define VIOL 0.132 0.126 0.117 0.039 multimodal posterior predictives, a bounded con- Table 1. Results for the MIMIC experiments with and with- strained region does imply that the posterior predic- out filtering out the points in the constrained region. Ac- tive might have probability mass on either side of the curacy and F1 score remain unchanged when using OC- bounded region (i.e. for all d dimensions of Y = Rd ). BNNs. For the experiment with filtration, the violation fac- Figure 2 (right), demonstrates that we capture chal- tor decreases by a factor of 3 when using OC-BNNs. lenging posterior predictives. within the constrained region. We train our model both with and without artificially filtering out all points within the positive constrained region. OC-BNNs maintain classification accuracy while re- ducing physiologically infeasible constraint viola- tions. Table 1 displays experimental results, with statistics computed from the posterior mean. In ad- dition to standard accuracy (ACC) and F1 score, we Figure 2. OC-BNNs capture important posterior qualities measure the violation fraction (VIOL), which is the such as correct OOD behavior and multimodality. (left) OC-BNNs (blue) maintain the same in-distribution uncer- fraction of predictions on held-out points that vio- tainty as the baseline (gray) while adhering to OOD positive late the constraints. The results show that OC-BNNs constraints (green) on either side of the plot. (right) OC- match standard BNNs on all predictive accuracy met- BNNs (blue) posterior samples go both below and above rics, with significantly lower violation of the con- the negative constraint box (red). strained region for the case where points originally in the constrained region are filtered out. 6. Applications 6.2. Human motion prediction 6.1. Clinical action prediction We evaluate OC-BNNs on data of humans conduct- MIMIC-III (Johnson et al., 2016) is a benchmark ing various motions available at (Kratzer, 2019) as de- database containing time series data of various physi- scribed in (Kratzer et al., 2018). This data contains hu- ological measurements and clinical actions prescribed man upper body poses across many reaching tasks at belonging to > 40, 000 intensive care patients who a frame rate of 120Hz. The poses are provided in the stayed at the Beth Israel Deaconess Medical Center form of upper body joint angles. between 2001 and 2012. Problem formulation Given a subset of trajecto- Problem Formulation From the raw time-series ries in (Kratzer, 2019), our goal is to predict joint an- data, we construct a balanced dataset for a time- gles 20 frames in the future from angles at the current independent classification task of hypotension man- time frame and the numerically computed joint veloc- agement. There are 9 features representing various ities and accelerations. In the following, we limit our- physiological states, such as mean blood pressure and selves to abduction and flexion (further denoted as Y- lactate levels. The goal is to predict if clinical action and Z-rotation to match the nomenclature in the orig- (either vasopressor or IV fluid) should be taken. inal data (Kratzer, 2019)) of the left and right shoulder during right-handed reaching motions. Constraints The constraint imposed is that for mean blood pressure less than 65 units, some ac- The joint angles in the test data were perturbed with tion should be taken, which is physiologically realis- normally distributed noise (µ=0, σ =2 degrees) to tic. We apply the positive (Dirichlet) constraint prior simulate a scenario in which a human motion predic- (Appendix A), as well as the weights-only prior base- tion model is trained on data recorded in a high-end line. In the given data, some training points fall motion capture lab, and then used to predict motion from data obtained by noisy wearable sensors.
Output-Constrained Bayesian Neural Networks Constraints Several anatomical feasibility or func- 7. Discussion tional range constraints for each of the joint angles could be applied, e.g. as described in (Namdari et al., OC-BNNs prevent constraint violation while fitting 2012). We derived constraints on the joint limits from low- and high-dimensional data. Our results high- the reaching motions provided in (Kratzer, 2019) as light that incorporating expert knowledge into OC- the empirically observed extrema across all motions, BNNs helps enforcing feasible and thus more ro- which is modeled using the negative constraint prior. bust predictions. Results for both datasets in Sec- tion 6 demonstrate that constraint violation metrics OC-BNNs prevent infeasible predictions. We com- are reduced significantly, whereas accuracy metrics pare a BNN and OC-BNN using the negative prior are nearly unchanged. This affirms the behavior ob- and the empirical bounds on joint angles. Both mod- served in the synthetic examples in Section 5. els are compared in (i) RMSE using the posterior pre- dictive mean (RMSE) [1·103 ], (ii) held-out data log Training data in constrained region can outweigh 2 ) with posterior predictive likelihood of N (µ pp , σpp prior effect. The clinical dataset results show that the mean µ pp and variance σpp (HO-LL), and (iii) poste- presence of data in C reduces the effect of constraint rior predictive violation defined as the percentage of priors. This is expected and in accordance with the probability mass in an infeasible constrained region Bayesian framework, where the likelihood effect will (PP-VIOL) [%], each evaluated at all target points. crowd out the prior given enough training data, and also suggests that the practitioner can use OC-BNNs These metrics are summarized in Table 2. We find that even for situations where the constraints themselves OC-BNNs reduce the possibility of making an infea- may not be fully satisfied. sible prediction to less than 0.001%, substantially im- proving on BNNs. Figure 3 shows exemplary motion OC-BNNs can facilitate data imputation. The fact predictions obtained with both BNN and OC-BNN that OC-BNNs model uncertainty correctly in con- for five consecutive points in a test trajectory. strained regions without losing predictive accuracy, even for high-dimensional datasets, show that OC- BNN OC-BNN BNNs can encode imputation in input regions with- RMSE 0.929 1.252 out training data. Rather than directly modifying the Train HO-LL 1718.409 1342.602 training set through imputation, prior beliefs about PP-VIOL 0.046 0.000 missing data can instead be formulated as constraints. RMSE 7.320 12.127 When to use which prior? In the regression setting, Test HO-LL 101.129 -683.697 negative priors are weakly informative whereas pos- PP-VIOL 18.447 0.000 itive priors tend to be strongly informative – one or Table 2. Results for human motion prediction. While pre- both of the prior types can be used depending on dictive performance and held-out log likelihood are similar, domain knowledge. While the negative prior for- OC-BNNs (negative prior) reduce the chance of predicting mulation does not apply to classification cases, this an infeasible position to 0.0 % while BNNs make infeasible does not pose a problem as negative and positive con- predictions in 18.4% of cases. straints are complements in discrete space. 8. Conclusion and Outlook We describe OC-BNNs, a formulation to incorporate expert knowledge into BNNs by prescribing positive and negative (i.e., desired and forbidden) regions, and demonstrate their application to synthetic and real-world data. We show that OC-BNNs generally maintain the desirable properties of regular BNNs while their predictions follow the prescribed con- Figure 3. Consecutive predictions of right Z-rotation during straints. This makes them a promising tool for set- an exemplary test trajectory with 10 frame gaps. The pos- tings like healthcare, where models trained on sparse terior predictive of BNN (black) and OC-BNN (blue) given data may be augmented with expert knowledge. In the input state 20 frames earlier are plotted along the y-axis. While both BNN and OC-BNN do not perfectly generalize addition, OC-BNNs may find applications in safe re- to the test set, the expert knowledge of valid joint positions inforcement learning, e.g. in tasks where certain ac- enforces feasible and thus more robust predictions. tions are known to have catastrophic consequences.
Output-Constrained Bayesian Neural Networks Acknowledgements A. Constraint Priors MG and FDV acknowledge support from AFOSR FA In this section, we describe the detailed functional 9550-17-1-0155. LL and WY acknowledge support forms of our positive and negative constraints and from the John A. Paulson School of Engineering and priors for both classification and regression settings, Applied Sciences at Harvard University. noting aspects important for inference. References A.1. Positive constraint prior Hafner, D., Tran, D., Lillicrap, T., Irpan, A., and Since C + describes the set of points that the learned Davidson, J. Reliable uncertainty estimates in deep function should model, g(φ(C x ; W ); C y , θ ) has the neural networks using noise contrastive priors. In straightforward interpretation of measuring how eprint arXiv:1807.09289, 2018. closely φ(C x ; W ) lies to C y . Most common prob- ability distributions as well as (possibly improper) Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., user-defined distributions are amenable, though dif- Feng, M., Ghassemi, M., Moody, B., Szolovits, P., ferentiability may be a condition for certain inference Celi, L. A., and Mark, R. G. Mimic-iii, a freely acces- methods. In particular, natural choices of distribu- sible critical care database. Scientific data, 3:160035, tions exist for both regression and classification. 2016. Regression In the simplest setting, for which there Kratzer, P. mocap-mlr-datasets. https://github. is a known ground-truth function described perfectly com/charlespwd/project-title, 2019. by C + , the Gaussian distribution is a natural choice: Kratzer, P., Toussaint, M., and Mainprice, J. Towards g(W | C + ) = ∏ 2 N (φ(x; W ); y, σ+ ) (6) combining motion optimization and data driven x,y∼π (C + ) dynamical models for human motion prediction. In 2018 IEEE-RAS 18th International Conference on Hu- where π (C + ) is a sampling distribution for C + , which manoid Robots (Humanoids), pp. 202–208. IEEE, 2018. is necessary for tractability if C + is large or infinite. π (C + ) itself can be user-defined as the domain al- Liu, Q. and Wang, D. Stein variational gradient de- lows, allowing for flexibility in sampling. σ+ is the scent: A general purpose bayesian inference algo- tuneable standard deviation of the Gaussian, control- rithm. In Advances In Neural Information Processing ling strictness of deviation from C + . More generally, it Systems, pp. 2378–2386, 2016. is possible that there exists multiple y ∈ C +y for some Lorenzi, M. and Filippone, M. Constraining the dy- x ∈ C+ x . This can be expressed using multimodal dis- namics of deep probabilistic models. arXiv preprint tributions, for example: arXiv:1802.05680, 2018. K MacKay, D. J. C. Probable networks and plausible g(W | C + ) = ∏ ∑ ωk N (φ(x; W ); yk , σ+2 ) (7) x,{y}kK ∼π (C + ) k =1 predictions – a review of practical bayesian meth- ods for supervised neural networks. In Network: where ωk are the user-defined mixture weights. Computation in Neural Systems, 6:3, 469-505, 1995. C+ Classification y describes the classes that the Namdari, S., Yagnik, G., Ebaugh, D. D., Nagda, S., BNN is constrained to for the corresponding C + x . In Ramsey, M. L., Williams Jr, G. R., and Mehta, S. the discrete setting, the natural distribution is the Defining functional shoulder range of motion for Dirichlet. For K classes, activities of daily living. Journal of shoulder and el- bow surgery, 21(9):1177–1183, 2012. g(W | C + ) = ∏ D ir(φ(x; W ); α) (8) x,y∼π (C + ) Neal, R. M. Bayesian Learning for Neural Networks. PhD ( 1 if yk = 1 thesis, Graduate Department of Computer Science, where αk = for some control- 1 − ασ otherwise University of Toronto, 1995. lable penalty ασ . Neal, R. M. Mcmc using hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo, 2012. A.2. Negative constraint prior Sun, S., Zhang, G., Shi, J., and Grosse, R. Func- The negative constraint prior enforces the infeasibil- tional variational bayesian neural networks. arXiv ity of regions in function space and is constructed by preprint arXiv:1903.05779, 2019. placing little prior probability on high expected vio-
Output-Constrained Bayesian Neural Networks lation of C − : As the presence of g(W | C) increases the magnitude of the prior pC (W ), empirical performance typically ! h i g(W |C − ) = exp E − γ c x, φ(x, W ); C − (9) improves by using a smaller step-size than with p(W ) x∼π (Cx− ) for the same dataset. In (9), c(x, y; C − ) is a classifier function that encodes Stein Variational Gradient Descent (SVGD) SVGD softly whether or not (x, y) is in C − , which allows (Liu & Wang, 2016) is a VI method where a set of black-box use with any inference technique: S particles (in our case, {W s }Ss=1 ) are optimized via functional gradient descent to mimic the true poste- J Kj rior. SVGD combines the efficiency of VI methods c(x, y; C − ) = ∑ ∏ στ0 ,τ1 f j x, y k (10) with the ability of MCMC methods to capture more j =1 k =1 expressive posterior approximations. p(W ) is substi- The definition of c(x, y; C − ) assumes that the nega- tuted by pC (W ) in the computation of the functional tive region C − is defined by J sets of K j inequality gradient: SJ constraints f j (x, y) ≤ 0, i.e. C − = C−j with j =1 1 S h − C j = {(x, y) | f j (x, y) ≤ 0}, which can define arbi- φ̂∗ (W ) = ∑ k(W s , W )∇W s [log pC (W s ) (13) S s =1 trary linear and nonlinear shapes in the input-output i space. στ0 ,τ1 (z) is a soft indicator of whether a con- + log p(D | W s )] + ∇W s k(W s , W ) straint of the form z ≤ 0 is satisfied, a more generally- Our implementation of SVGD uses the weighted RBF parameterizable sigmoidal activation defined as kernel k ( x, x 0 ) = exp(− 1h || x − x 0 ||22 ) and adapting στ0 ,τ1 (z) = tanh(−τ0 z) + 1 tanh(−τ1 z) + 1 (11) bandwith h as suggested in (Liu & Wang, 2016) as well as mini-batched data D for tractability. If all constraints for at least one infeasible region C − j are satisfied, our prior knowledge is violated and c(x, y; C − ) is far from 0. Otherwise, at least one con- C. Experimental Details straint of all infeasible regions is violated and our C.1. Synthetic Examples prior beliefs satisfied; c(x, y; C − ) is close to 0. Con- trary to other classification functions, the product of For all experiments, the BNN used comprises a single two tanh functions with different scales τ0 , τ1 enables hidden layer with 10 nodes, and Radial Basis Func- a sharp and steep overall classification of violating tion (RBF) activations σ ( x ) = exp{− x2 }. values in z > 0 and a smoother and flatter classifica- All regression plots show the posterior mean function tion for satisfying values in z ≤ 0, making gradients (bold line) as well as the confidence intervals for σ = less vanishing for constraint-satisfying, i.e. region- 1 (dark shading) and σ = 2 (light shading). violating inputs. We use τ0 = 15, τ1 = 2. Figure 1: (left) The constrained regions are y < 2.5 B. Inference and y > 3 for x ∈ [−0.3, 0.3]. The function gener- ating the training points is y = − x4 + 3x2 + 1. The Constraint priors can be substituted for the tradi- negative prior formulation is used. (right) The input tional prior term p(W ) with any black-box sampling space is 2-dimensional and there are 3 classes (color- or variational inference (VI) algorithm. Here, we pro- coded) with 8 training points in each class, generated vide a summary of the algorithms we use and de- from the Gaussian means (−3, 1), (0, −3) and (2, 3). scribe the trivial modifications used to incorporate The constrained region is [1, 3] × [−2, 0] and defined constraint priors pC (W ). Note that the general form such that points within the box should be classified as of pC (W ) is not normalized, which does not pose a green. The positive prior is used. HMC (10000 burn- problem for inference in practice. in, 1000 samples collected at intervals of 10) is used for both examples. Hamiltonian Monte Carlo (HMC) HMC (Neal, 2012) is a MCMC method considered to be the “gold stan- Figure 2: (left) The positive constraints are y = − x + dard” in posterior sampling even though not being 5 for x ∈ [−5.0, −3.0] and y = x + 5 for x ∈ [3.0, 5.0]. scalable. We substitute p(W ) by pC (W ) in the poten- Both constraints are Gaussian with the σ = 0.5. The tial energy term U (W ) computed at each sampling 3 training points are arbitrarily defined. HMC (10000 iteration: burn-in, 1000 samples collected at intervals of 10) is used. (Right) The constrained boxed region is x ∈ U (W ) = − log pC (W ) − log p(D | W ) (12) [−1.0, 1.0] and y ∈ [−5.0, 3.0]. The function generat-
Output-Constrained Bayesian Neural Networks ing the training points is y = − x4 + 3x2 + 1. SVGD with 75 particles is used with Adagrad. C.2. Clinical action prediction For all experiments, the BNN used comprises a 2 hid- den layers of 200 nodes each and RBF activations. SVGD is used for inference with 50 particles, 1500 it- erations, Adagrad optimization, and a suitable batch Figure 4. (left) The same training set as Figure 2 (left), but size. The size of the full dataset is 298K; this reduces with negative constraints defined out-of-distribution. OC- to 125K when points in the constrained region are fil- BNNs fit the sparse data while avoiding the constraints. tered out. Details on the prior formulation for can be (right) Positive prior with mixture of two Gaussians. Us- found in A. The Dirichlet parameter is set to 10 for ing SVGD, individual OC-BNN samples (blue) capture both allowed classes and 0.01 for forbidden classes. modes. C.3. Human motion prediction For these experiments, the BNN used comprises a 2 hidden layers of 100 nodes each and RBF activations. For inference, we again used SVGD and Adagrad with 50 particles and 1000 iterations. The negative prior used 50 samples from π (C − x ) and γ = 10, 000, see Eq. 9. We randomly chose a subset of 10 right-handed reaching trajectories from (Kratzer, 2019). This data was randomly split into 5 training and 5 test tra- jectories, which amounts to 243 train Markov states of sensors for training and 142 states for evalua- tion. Given this problem setting, the regression task had 12-dimensional inputs and 4-dimensional tar- gets. The number of training trajectories was kept low to increase sparsity and the difficulty of successful ro- bust generalization. D. Additional Results D.1. Additional Synthetic Examples Figure 4 shows additional examples for out-of- distribution and multimodal behavior. (left) Out- of-distribution negative constraints. The negative constraints are y > − x + 7 and y < − x + 2 for x ∈ [−5.0, −3.0] and y > x + 7 and y < x + 2 for x ∈ [3.0, 5.0]. The training points are identical to those in the left plot of Figure 2. HMC (10000 burn- in, 1000 samples collected at intervals of 10) is used. (right) Multimodal positive constraints. The two pos- itive functions are y = −0.2x3 + 0.5x2 + 0.7x − 0.5 and y = 0.2x3 − 0.15x2 + 3.5, both for the domain x ∈ [−1.0, 1.0]. The training points were arbitrarily defined. An equally-weighted mixture of two Gaus- sians with σ = 0.5 is used as the positive constraint prior. SVGD with 75 particles and Adagrad are used.
You can also read