Hinge-loss Markov Random Fields: Convex Inference for Structured Prediction
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hinge-loss Markov Random Fields: Convex Inference for Structured Prediction Stephen H. Bach Bert Huang Ben London Lise Getoor Computer Science Dept. University of Maryland College Park, MD, 20742, USA Abstract loss Markov random fields (HL-MRFs) [3], these mod- els are analogous to discrete Markov random fields, Graphical models for structured domains are except that random variables are continuous valued powerful tools, but the computational com- in the unit interval [0,1], and potentials are linear or plexities of combinatorial prediction spaces squared hinge-loss functions. can force restrictions on models, or re- In this work, we demonstrate that HL-MRFs are quire approximate inference in order to be powerful tools for structured prediction by producing tractable. Instead of working in a combina- state-of-the-art performance in a number of domains. torial space, we use hinge-loss Markov ran- We are the first to leverage some of the most pow- dom fields (HL-MRFs), an expressive class erful features of HL-MRFs, such as squared poten- of graphical models with log-concave density tials, which we show produce better results on mul- functions over continuous variables, which tiple tasks. We show that HL-MRFs are well-suited to can represent confidences in discrete predic- structured prediction for the following reasons. They tions. This paper demonstrates that HL- are expressive, interpretable, and easily defined using MRFs are general tools for fast and accu- the modeling language probabilistic soft logic (PSL) rate structured prediction. We introduce the [6, 13]. Further, continuous variables are useful both first inference algorithm that is both scalable for modeling continuous data as well as for expressing and applicable to the full class of HL-MRFs, confidences in discrete predictions. Confidences are and show how to train HL-MRFs with several desirable for the same reason that practitioners often learning algorithms. Our experiments show prefer marginal probabilities to the single most prob- that HL-MRFs match or surpass the predic- able discrete prediction. Finally, HL-MRFs have log- tive performance of state-of-the-art methods, concave density functions, so finding an exact most including discrete models, in four application probable explanation (MPE) for a given input is a domains. convex optimization, and therefore exactly solvable in polynomial time. Our specific contributions include the following. First, 1 INTRODUCTION we introduce a new, fast algorithm for MPE inference in HL-MRFs, which is the first to be both scalable The study of probabilistic modeling in structured and applicable to the full class of HL-MRFs. Sec- and relational domains primarily focuses on predict- ond, we show how to train HL-MRFs with several ing discrete variables [12, 19, 24]. However, except learning algorithms. Third, we show empirically that for some isolated cases, the combinatorial nature of these advances enable HL-MRFs to tackle a diverse set discrete, structured prediction spaces requires conces- of relational and structured prediction tasks, provid- sions: most notably, for inference algorithms to be ing state-of-the-art performance on collective classifi- tractable, they must be relaxed or approximate (e.g., cation, social-trust prediction, collaborative filtering, [19, 22, 25]), or the model’s structure must be re- and image reconstruction. In particular, we show that stricted (e.g., [2, 8]). Broecheler et al. [6] introduced HL-MRFs can outperform their discrete counterparts, a class of models for continuous variables that has the as well as other leading methods. potential to combine fast and exact inference with the expressivity of discrete graphical models, but this po- tential has not been well explored. Now called hinge-
1.1 RELATED WORK Definition 1. Let Y = (Y1 , . . . , Yn ) be a vector of n variables and X = (X1 , . . . , Xn0 ) a vector of n0 0 Probabilistic soft logic (PSL) [6, 13] is a declarative variables with joint domain D = [0, 1]n+n . Let φ = language for defining templated HL-MRFs. Its de- (φ1 , . . . , φm ) be m continuous potentials of the form velopment was partially motivated by the need for rich models of continuous similarity values, for use φj (Y, X) = [max {`j (Y, X), 0}]pj in tasks such as entity resolution, collective classifi- cation, and ontology alignment. PSL is in a family of where `j is a linear function of Y and X and pj ∈ systems for defining templated, relational probabilistic {1, 2}. Let C = (C1 , . . . , Cr ) be linear constraint func- models that includes, for instance, Markov logic net- tions associated with index sets denoting equality con- works [19], relational dependency networks [17], and straints E and inequality constraints I, which define relational Markov networks [23]. In our experiments, the feasible set we compare against Markov logic networks, which use Ck (Y, X) = 0, ∀k ∈ E a first-order syntax similar to PSL’s to build discrete D̃ = Y, X ∈ D . Ck (Y, X) ≥ 0, ∀k ∈ I probabilistic models. MPE inference algorithms for HL-MRFs solve a con- For Y, X ∈ D̃, given a vector of nonnegative free strained, convex optimization. A standard approach parameters, i.e., weights, λ = (λ1 , . . . , λm ), a con- for general, constrained, convex optimization is to use strained hinge-loss energy function fλ is defined as an interior-point method, which Broecheler et al. [6] m use. While theoretically efficient, the practical run- X fλ (Y, X) = λj φj (Y, X) . ning time of interior-point optimizers quickly becomes j=1 cumbersome for large problems. For discrete graph- ical models, recent advances use consensus optimiza- Definition 2. A hinge-loss Markov random field P tion to obtain fast, approximate MPE inference algo- over random variables Y and conditioned on random rithms [5, 15, 16]. Bach et al. [3] recently developed variables X is a probability density defined as follows: an analogous algorithm for exact MPE inference in if Y, X ∈ / D̃, then P (Y|X) = 0; if Y, X ∈ D̃, then HL-MRFs that produced a significant improvement in running time over interior-point methods, though it 1 P (Y|X) = exp [−fλ (Y, X)] , was limited to pairwise potentials and constraints, and Z(λ) cost considerably more computation to optimize over R squared potentials. where Z(λ) = Y exp [−fλ (Y, X)]. The learning methods we adapt for HL-MRFs are stan- Thus, MPE inference is equivalent to finding the min- dard approaches for learning parameters of probabilis- imizer of the convex energy fλ . tic models. In particular, our adaptations are anal- The potentials and weights can be grouped together ogous to previous learning algorithms for relational into templates, which can be used to define general and structured models using approximate maximum- classes of HL-MRFs that are parameterized by the in- likelihood or maximum-pseudolikelihood estimation put data. Let T = (t1 , . . . , ts ) denote a vector of tem- [6, 14, 17, 19] and large-margin estimation [11, 12, 24]. plates with associated weights Λ = (Λ1 , . . . , Λs ). We partition the potentials P by their associated templates 2 HINGE-LOSS MARKOV and let Φq (Y, X) = j∈tq φj (Y, X) for all tq ∈ T . In RANDOM FIELDS the HL-MRF, the weight of the j’th hinge-loss poten- tial is set to the weight of the template from which it Hinge-loss Markov random fields (HL-MRFs) are a was derived, i.e., λj = Λq , for each j ∈ tq . general class of conditional, continuous probabilistic Probabilistic soft logic [6, 13] provides a natural in- models [3]. HL-MRFs are log-linear models whose fea- terface to represent hinge-loss potential templates us- tures are hinge-loss functions of the variable states. ing logical rules. In particular, a logical conjunc- Through constructions based on soft logic, hinge-loss tion of Boolean variables X ∧ Y can be general- potentials can be used to model generalizations of log- ized to continuous variables using the hinge function ical conjunction and implication. HL-MRFs can be max{X + Y − 1, 0}, which is known as the Lukasiewicz defined using the modeling language probabilistic soft t-norm. Disjunction X ∨Y is relaxed to min{X +Y, 1}, logic (PSL) [6, 13], making these powerful models in- and negation ¬X to 1 − X. PSL allows modelers to terpretable, flexible, and expressive. In this section, we design rules that, given data, ground out substitutions formally present constrained hinge-loss energy func- for logical terms. The groundings of a template define tions, HL-MRFs, and briefly review PSL. hinge-loss potentials that share the same weight and
have the form one minus the truth value of the ground alternating-direction method of multipliers (ADMM) rule. We defer to Broecheler et al. [6] and Kimmig [5]. The algorithm works by creating local copies of the et al. [13] for details on PSL. variables in each potential and constraint, constrain- ing them to be equal to the original variables, and re- To further demonstrate this templating, consider the laxing those equality constraints to make independent task of predicting who trusts whom in a social network. subproblems. By solving the subproblems repeatedly Let the network contain three individuals: A, B, and and averaging the results, the algorithm reaches a con- C. We can design an HL-MRF to include potentials sensus on the best values of the original variables, also that encode the belief that trust is transitive (which called the consensus variables. This procedure is guar- is a rule we use in our experiments). Let the variable anteed to converge to the global minimizer of fλ . See YA,B represent how much A trusts B, and similarly so [3] and [5] for more details on consensus optimization for YB,C and YA,C . Then the potential and ADMM. φ(Y, X) = [max{YA,B + YB,C − YA,C − 1, 0}]p This previous consensus-optimization approach to is equivalent to one minus the truth value of MPE inference works well for linear potentials with at the Boolean formula YA,B ∧ YB,C → YA,C when most two unobserved variables, and empirical evidence YA,B , YB,C , YA,C ∈ {0, 1}. When they are allowed to suggests it scales linearly with the size of the problem take on their full range [0, 1], the potential is a convex [3]. However, it is restricted to pairwise potentials and relaxation of the implication. An HL-MRF with this constraints, and requires an interior-point method as potential function assigns higher probability to vari- a subroutine to solve subproblems induced by squared able states that satisfy the logical implication above, potentials. Because of the embedded interior-point which can occur to varying degrees in the continuous method, its running time can increase roughly 100-fold domain. Given a social network with more than these with squared potentials [3]. three individuals, PSL can ground out possible sub- We improve the algorithm of Bach et al. [3] by refor- stitutions for the roles of A, B, and C to generate mulating the optimization to enforce Y ∈ [0, 1]n only potential functions for each substitution, thus defining on the consensus variables, not the local copies. This the full, ground HL-MRF. form of consensus optimization is described in greater HL-MRFs support a few additional components useful detail by Boyd et al. [5]. The result is that, in our algo- for modeling. The constraints in Definition 1 allow the rithm, the potentials and constraints are not restricted encoding of functional modeling requirements, which to a certain number of unknowns, and the subproblems is useful, e.g., when variables correspond to mutually can all be solved quickly using simple linear algebra. exclusive labels, and thus should sum to one. The Algorithm 1 gives pseudocode for our new algorithm. exponent parameter pj allows flexibility in the shape It starts by initializing local copies of the variables that of the hinge, affecting the sharpness of the penalty for appear in each potential and constraint, along with violating the logical implication. Setting pj to 1 penal- a corresponding Lagrange multiplier for each copy. izes violation linearly with the amount the implication Then, until convergence, it iteratively updates La- is unsatisfied, while setting pj to 2 penalizes small vi- grange multipliers and solves suproblems induced by olations much less. In effect, some linear potentials the HL-MRF’s potentials and constraints. If the sub- overrule others, while the influences of squared poten- problem is induced by a potential, it sets the local vari- tials are averaged together. able copies to a balance between the minimizer of the potential and the emerging consensus. Eventually the 3 MPE INFERENCE Lagrange multipliers will enforce agreement between the local copies and the consensus. If instead the sub- MPE inference for HL-MRFs requires finding a feasible problem is induced by a constraint in the HL-MRF, assignment that minimizes fλ . Performing MPE infer- the algorithm projects the consensus variables’ current ence quickly is crucial, especially because weight learn- values to that constraint’s feasible region. The con- ing often requires performing inference many times sensus variables are updated at each iteration based with different weights (as we discuss in Section 4). on the values of their local copies and corresponding Here, HL-MRFs have a distinct advantage over gen- Lagrange multipliers, and clipped to [0,1]. eral discrete models, since minimizing fλ is a convex optimization rather than a combinatorial one. In this We take the same basic approach as Bach et al. [3] to section, we detail a new, faster MPE inference algo- solve the subproblems. We first try to find a minimizer rithm for HL-MRFs. on either side of the hinge (where `j (Y, X) is either positive or negative) before projecting onto the plane Bach et al. [3] showed how to minimize fλ with defined by `j (Y, X) = 0. Without the constraints on a consensus-optimization algorithm, based on the
Algorithm 1 MPE Inference for HL-MRFs performs approximate maximum-likelihood estimation Input: HL-MRF(Y, X, φ, λ, C, E, I), ρ > 0 using MPE inference to approximate the gradient of the log-likelihood. The second method maximizes the Initialize yj as copies of the variables Yj that appear pseudolikelihood. The third method finds a large- in φj , j = 1, . . . , m margin solution, preferring weights that discriminate Initialize yk+m as copies of the variables Yk+m the ground truth from other states. We describe below that appear in Ck , k = 1, . . . , r how to apply these learning strategies to HL-MRFs. Initialize Lagrange multipliers αi corresponding to variable copies yi , i = 1, . . . , m + r 4.1 MAXIMUM-LIKELIHOOD while not converged do ESTIMATION for j = 1, . . . , m do αj ← αj + ρ(yj − Yj ) The canonical approach for learning parameters Λ is yj ← Yj − αj /ρ to maximize the log-likelihood of training data. The if `j (yj , X) > 0 then partial derivative of the log-likelihood with respect to λj [`j (yj , X)]pj a parameter Λq is yj ← arg minyj + ρ2 kyj − Yj + ρ1 αj k22 ∂ log p(Y|X) if `j (yj , X) < 0 then = EΛ [Φq (Y, X)] − Φq (Y, X) , yj ← Proj`j =0 (Yj ) ∂Λq end if where EΛ is the expectation under the distribution de- end if fined by Λ. The voted perceptron algorithm [7] opti- end for mizes Λ by taking steps of fixed length in the direc- for k = 1, . . . , r do tion of the gradient, then averaging the points after αk+m ← αk+m + ρ(yk+m − Yk+m ) all steps. Any step that is outside the feasible region yk+m ← ProjCk (Yk+m ) is projected back before continuing. For a smoother end for ascent, it is often helpful to divide the q-th component for i = 1, . . . , n do of the gradient by the number of groundings |tq | of the q’th template [14], which we do in our experiments. 1 P αc Yi ← |copies(Y i )| yc ∈copies(Yi ) yc + ρ Computing the expectation is intractable, so we use Clip Yi to [0,1] a common approximation: the values of the potential end for functions at the most probable setting of Y with the end while current parameters [6]. 4.2 MAXIMUM-PSEUDOLIKELIHOOD the local variable copies, finding these minimizers and ESTIMATION projections is much simpler. Solving for the constraint subproblems is now also simpler, requiring just a pro- Since exact maximum likelihood estimation is in- jection onto a hyperplane or a halfspace. tractable in general, we can instead perform Our algorithm retains all of the benefits of the original maximum-pseudolikelihood estimation (MPLE) [4], MPE inference algorithm while removing all restric- which maximizes the likelihood of each variable condi- tions on the numbers of unknowns in the potentials tioned on all other variables, i.e., and constraints, making MPE inference fast with both n Y linear and squared potentials. Another advantage of P ∗ (Y|X) = P ∗ (Yi |MB(Yi )) consensus optimization for MPE inference is that it is i=1 easy to warm start. Warm starting provides signifi- Yn 1 exp −fλi (Yi , Y, X) ; cant efficiency gains when inference is repeated on the = Z(λ, Yi ) same HL-MRF with small changes in the weights, as i=1 Z often occurs during weight learning. i Z(λ, Yi ) = exp −fλ (Yi , Y, X) ; Y i 4 WEIGHT LEARNING X fλi (Yi , Y, X) = λj φj {Yi ∪ Y\i }, X . j:i∈φj In this section, we present three weight learning meth- ods for HL-MRFs, each with a different objective func- Here, i ∈ φj means that Yi is involved in φj , and tion, two of which are new for learning HL-MRFs. MB(Yi ) denotes the Markov blanket of Yi —that is, The first method, introduced by Broecheler et al. [6], the set of variables that co-occur with Yi in any po-
tential function. The partial derivative of the log- objective for a large-margin solution pseudolikelihood with respect to Λq is 1 min ||Λ||2 + Cξ 2 n Λ≥0 ∂ log P ∗ (Y |X) X X = EYi |MB φj (Y, X) s.t. Λ> (Φ(Y, X) − Φ(Ỹ, X)) ≤ −L(Y, Ỹ) + ξ, ∀Y, ∂Λq i=1 j∈tq :i∈φj where Φ(Y, X) = (Φ1 (Y, X), . . . , Φs (Y, X)). This − Φj (Y, X). formulation is analogous to the margin-rescaling ap- Computing the pseudolikelihood gradient does not re- proach by Joachims et al. [12]. Though such a struc- quire inference and takes time linear in the size of tured objective is natural and intuitive, its number Y. However, the integral in the above expectation of constraints is the cardinality of the output space, does not readily admit a closed-form antiderivative, which here is infinite. Following their approach, we so we approximate the expectation. When a variable optimize subject to the infinite constraint set using in unconstrained, the domain of integration is a one- a cutting-plane algorithm: we greedily grow a set K dimensional interval on the real number line, so Monte of constraints by iteratively adding the worst-violated Carlo integration quickly converges to an accurate es- constrain given by a separation oracle, then updating timate of the expectation. Λ subject to the current constraints. The goal of the cutting-plane approach is to efficiently find the set of We can also apply MPLE when the constraints are not active constraints at the solution for the full objec- too interdependent. For example, for linear equality tive, without having to enumerate the infinite inactive constraints over disjoint groups of variables (e.g., vari- constraints. The worst-violated constraint is able sets that must sum to 1.0), we can block-sample the constrained variables by sampling uniformly from arg min Λ> Φ(Ỹ, X) − L(Y, Ỹ). a simplex. These types of constraints are often used Ỹ to represent mutual exclusivity of classification labels. The separation oracle performs loss-augmented in- We can compute accurate estimates quickly because ference by adding additional loss-augmenting poten- these blocks are typically low-dimensional. tials to the HL-MRF. For ground truth in {0, 1}, these loss-augmenting potentials are also examples of 4.3 LARGE-MARGIN ESTIMATION hinge-losses, and thus adding them simply creates an augmented HL-MRF. The worst-violated constraint A different approach to learning drops the probabilistic is then computed as standard inference on the loss- interpretation of the model and views HL-MRF infer- augmented HL-MRF. However, ground-truth variables ence as a prediction function. Large-margin estima- in the interior (0, 1) cause any distance-based loss to be tion (LME) shifts the goal of learning from producing concave, which require the separation oracle to solve accurate probabilistic models to instead producing ac- a non-convex objective. For interior ground truth val- curate MPE predictions. The learning task is then to ues, we use the difference of convex functions algo- find the weights Λ that provide high-accuracy struc- rithm [1] to find a local optimum. Since the concave tured predictions. We describe in this section a large- portion of the loss-augmented inference objective piv- margin method based on the cutting-plane approach ots around the ground truth value, the subgradients for structural support vector machines (SVMs) [12]. are 1 or −1, depending on whether the current value The intuition behind large-margin structured predic- is greater than the ground truth. We simply choose an tion is that the ground-truth state should have energy initial direction for interior labels by rounding, and flip lower than any alternate state by a large margin. In the direction of the subgradients for variables whose our setting, the output space is continuous, so we pa- solution states are not in the interval corresponding to rameterize this margin criterion with a continuous loss the subgradient direction until convergence. function. For any valid output state Ỹ, a large-margin Given a set K of constraints, we solve the SVM objec- solution should satisfy: tive as in the primal form minΛ≥0 21 ||Λ||2 + Cξ s.t. K. We then iteratively invoke the separation oracle to find fλ (Y, X) ≤ fλ (Ỹ, X) − L(Y, Ỹ), ∀Ỹ, the worst-violated constraint. If this new constraint is where the decomposable loss function L(Y, Ỹ) = not violated, or its violation is within numerical toler- P ance, we have found the max-margin solution. Other- i L(Yi , Ỹi ) measures the disagreement between a wise, we add the new constraint to K, and repeat. state Ỹ and the training label state Y. We define L as the `1 distance. Since we do not expect all prob- One fact of note is that the large-margin criterion lems to be perfectly separable, we relax this constraint always requires a little slack for squared HL-MRFs. with a penalized slack ξ. We obtain a convex learning Since the squared hinge potential is quadratic and the
loss is linear, there always exists a small enough dis- Table 1: Average accuracy of classification by HL- tance from the ground truth such that an absolute (i.e., MRFs and discrete MRFs. Scores statistically equiva- linear) distance is greater than the squared distance. lent to the best scoring method are typed in bold. In these cases, the slack parameter trades off between the peakedness of the learned quadratic energy func- Citeseer Cora tion and the margin criterion. HL-MRF-Q (MLE) 0.729 0.816 HL-MRF-Q (MPLE) 0.729 0.818 5 EXPERIMENTS HL-MRF-Q (LME) 0.683 0.789 HL-MRF-L (MLE) 0.724 0.802 To demonstrate the flexibility and effectiveness of HL- HL-MRF-L (MPLE) 0.729 0.808 MRFs, we test them on four diverse learning tasks: HL-MRF-L (LME) 0.695 0.789 collective classification, social-trust prediction, prefer- ence prediction, and image reconstruction. 1 Each of MRF (MLE) 0.686 0.756 these experiments represents a problem domain that MRF (MPLE) 0.715 0.797 is best solved with relational learning approaches be- MRF (LME) 0.687 0.783 cause structure is a critical component of their prob- lems. The experiments show that HL-MRFs perform as well as or better than state-of-the-art approaches. 5.1 COLLECTIVE CLASSIFICATION For these diverse tasks, we compare against a number of competing methods. For collective classification and When classifying documents, links between those social-trust prediction, we compare HL-MRFs to dis- documents—such as hyperlinks, citations, or co- crete Markov random fields (MRFs). We construct authorship—provide extra signal beyond the local fea- them with Markov logic networks (MLNs) [19], which tures of individual documents. Collectively predicting template discrete MRFs using logical rules similarly to document classes with these links tends to improve PSL for HL-MRFs. We perform inference in discrete accuracy [21]. We classify documents in citation net- MRFs using 2500 rounds of the sampling algorithm works using data from the Cora and Citeseer scientific MC-Sat (500 of which are burn in), and we find ap- paper repositories. The Cora data set contains 2,708 proximate MPE states during MLE learning using the papers in seven categories, and 5,429 directed citation search algorithm MaxWalkSat [19]. For collaborative links. The Citeseer data set contains 3,312 papers in filtering, a task that is inherently continuous and non- six categories, and 4,591 directed citation links. trivial to encode in discrete logic, we compare against Bayesian probabilistic matrix factorization [20]. Fi- The prediction task is, given a set of seed documents nally, for image reconstruction, we run the same exper- whose labels are observed, to infer the remaining doc- imental setup as Poon and Domingos [18] and compare ument classes by propagating the seed information against the results they report, which include tests us- through the network. For each of 20 runs, we split the ing sum product networks, deep belief networks, and data sets 50/50 into training and testing partitions, deep Boltzmann machines. and seed half of each set. To predict discrete cate- gories with HL-MRFs we predict the category with When appropriate, we evaluate statistical significance the highest predicted value. using a paired t-test with rejection threshold 0.01. We omit variance statistics to save space and only report We compare HL-MRFs to discrete MRFs on this task. the average and statistical significance. We describe We construct both using the same logical rules, which the HL-MRFs used for our experiments using the PSL simply encode the tendency for a class to propagate rules that define them. To investigate the differences across citations. For each category Ci , we have two between linear and squared potentials we use both in separate rules for each direction of citation: our experiments. HL-MRF-L refers to a model with all linear potentials and HL-MRF-Q to one with all Label(A, Ci ) ∧ Cites(A, B) ⇒ Label(B, Ci ), squared potentials. When training with MLE and Label(A, Ci ) ∧ Cites(B, A) ⇒ Label(B, Ci ). MPLE, we use 100 steps of voted perceptron and a step size of 1.0 (unless otherwise noted), and for LME we set C = 0.1. We experimented with various set- Table 1 lists the results of this experiment. HL-MRFs tings, but the scores of HL-MRFs and discrete MRFs are the most accurate predictors on both data sets. We were not sensitive to changes. also note that both variants of HL-MRFs are much faster than discrete MRFs. See Table 3 for average 1 All code is available at http://psl.umiacs.umd.edu. inference times on five folds.
Table 2: Average area under ROC and precision-recall Table 3: Average inference times (reported in seconds) curves of social-trust prediction by HL-MRFs and dis- of single-threaded HL-MRFs and discrete MRFs. crete MRFs. Scores statistically equivalent to the best scoring method by metric are typed in bold. Citeseer Cora Epinions HL-MRF-Q 0.42 0.70 0.32 ROC P-R (+) P-R (-) HL-MRF-L 0.46 0.50 0.28 HL-MRF-Q (MLE) 0.822 0.978 0.452 MRF 110.96 184.32 212.36 HL-MRF-Q (MPLE) 0.832 0.979 0.482 HL-MRF-Q (LME) 0.814 0.976 0.462 HL-MRF-L (MLE) 0.765 0.965 0.357 them using various learning algorithms. We compute HL-MRF-L (MPLE) 0.757 0.963 0.333 three metrics: the area under the receiver operating HL-MRF-L (LME) 0.783 0.967 0.453 characteristic (ROC) curve, and the areas under the MRF (MLE) 0.655 0.942 0.270 precision-recall curves for positive trust and negative MRF (MPLE) 0.725 0.963 0.298 trust. On all three metrics, HL-MRFs with squared MRF (LME) 0.795 0.973 0.441 potentials score significantly higher. The differences among the learning methods for squared HL-MRFs are insignificant, but the differences among the models is 5.2 SOCIAL-TRUST PREDICTION statistically significant for the ROC metric. For area under the precision-recall curve for positive trust, dis- An emerging problem in the analysis of online social crete MRFs trained with LME are statistically tied networks is the task of inferring the level of trust be- with the best score, and both HL-MRF-L and discrete tween individuals. Predicting the strength of trust MRFs trained with LME are statistically tied with the relationships can provide useful information for viral best area under the precision-recall curve for negative marketing, recommendation engines, and internet se- trust. The results are listed in Table 2. curity. HL-MRFs with linear potentials have recently Though the random fold splits are not the same, using been applied by Huang et al. [10] to this task, show- the same experimental setup, Huang et al. [10] also ing superior results with models based on sociologi- scored the precision-recall area for negative trust of cal theory. We reproduce their experimental setup us- standard trust prediction algorithms EigenTrust and ing their sample of the signed Epinions trust network, TidalTrust, which scored 0.131 and 0.130, respectively. in which users indicate whether they trust or distrust The logical models based on structural balance that other users. We perform eight-fold cross-validation. In we run here are significantly more accurate, and HL- each fold, the prediction algorithm observes the entire MRFs more than discrete MRFs. unsigned social network and all but 1/8 of the trust ratings. We measure prediction accuracy on the held- Table 3 lists average inference times on five folds of out 1/8. The sampled network contains 2,000 users, three prediction tasks: Cora, Citeseer, and Epinions. with 8,675 signed links. Of these links, 7,974 are pos- We implemented each method in Java. Both HL- itive and only 701 are negative. MRF-Q and HL-MRF-L are much faster than discrete MRFs. This illustrates an important difference be- We use a model based on the social theory of struc- tween performing structured prediction via convex in- tural balance, which suggests that social structures are ference versus sampling in a discrete prediction space: governed by a system that prefers triangles that are using our MPE inference algorithm is much faster. considered balanced. Balanced triangles have an odd number of positive trust relationships; thus, consider- ing all possible directions of links that form a triad of 5.3 PREFERENCE PREDICTION users, there are sixteen logical implications of the form Preference prediction is the task of inferring user at- Trusts(A, B) ∧ Trusts(B, C) ⇒ Trusts(A, C). titudes (often quantified by ratings) toward a set of items. This problem is naturally structured, since a Huang et al. [10] list all sixteen of these rules, a reci- user’s preferences are often interdependent, as are an procity rule, and a prior in their Balance-Recip model, item’s ratings. Collaborative filtering is the task of which we omit to save space. predicting unknown ratings using only a subset of ob- Since we expect some of these structural implications served ratings. Methods for this task range from sim- to be more or less accurate, learning weights for these ple nearest-neighbor classifiers to complex latent fac- rules provides better models. Again, we use these rules tor models. To illustrate the versatility of HL-MRFs, to define HL-MRFs and discrete MRFs, and we train we design a simple, interpretable collaborative filtering
5.4 IMAGE RECONSTRUCTION Table 4: Normalized mean squared/absolute errors (NMSE/NMAE) for preference prediction using the Digital image reconstruction requires models that un- Jester dataset. The lowest errors are typed in bold. derstand how pixels relate to each other, such that when some pixels are unobserved, the model can in- NMSE NMAE fer their values from parts of the image that are ob- HL-MRF-Q (MLE) 0.0554 0.1974 served. We construct pixel-grid HL-MRFs for image HL-MRF-Q (MPLE) 0.0549 0.1953 reconstruction. We test these models using the exper- HL-MRF-Q (LME) 0.0738 0.2297 imental setup of Poon and Domingos [18]: we recon- HL-MRF-L (MLE) 0.0578 0.2021 struct images from the Olivetti face data set and the HL-MRF-L (MPLE) 0.0535 0.1885 Caltech101 face category. The Olivetti data set con- HL-MRF-L (LME) 0.0544 0.1875 tains 400 images, 64 pixels wide and tall, and the Cal- tech101 face category contains 435 examples of faces, BPMF 0.0501 0.1832 which we crop to the center 64 by 64 patch, as was done by Poon and Domingos [18]. Following their ex- perimental setup, we hold out the last fifty images and predict either the left half of the image or the bottom half. model for predicting humor preferences. We test this model on the Jester dataset, a repository of ratings The HL-MRFs in this experiment are much more com- from 24,983 users on a set of 100 jokes [9]. Each joke plex than the ones in our other experiments because is rated on a scale of [−10, +10], which we normalize we allow each pixel to have its own weight for the fol- to [0, 1]. We sample a random 2,000 users from the lowing rules, which encode agreement or disagreement set of those who rated all 100 jokes, which we then between neighboring pixels: split into 1,000 train and 1,000 test users. From each train and test matrix, we sample a random 50% to use Bright(Pij , I) ∧ North(Pij , Q) ⇒ Bright(Q, I), as the observed features X; the remaining ratings are Bright(Pij , I) ∧ North(Pij , Q) ⇒ ¬Bright(Q, I), treated as the variables Y. ¬Bright(Pij , I) ∧ North(Pij , Q) ⇒ Bright(Q, I), Our HL-MRF model uses an item-item similarity rule: ¬Bright(Pij , I) ∧ North(Pij , Q) ⇒ ¬Bright(Q, I), where Bright(Pij , I) is the normalized brightness of pixel Pij in image I, and North(Pij , Q) indicates that SimRate(J1 , J2 ) ∧ Likes(U, J1 ) ⇒ Likes(U, J2 ) Q is the north neighbor of Pij . We similarly include analogous rules for the south, east, and west neighbors, as well as the pixels mirrored across the horizontal and where J1 , J2 are jokes and U is a user; the pred- vertical axes. This setup results in up to 24 rules per icate Likes indicates the degree of preference (i.e., pixel, which, in a 64 by 64 image, produces 80,896 rating value); and SimRate measures the mean- weighted potential templates. adjusted cosine similarity between the observed rat- ings of two jokes. We also include rules to enforce that We train these HL-MRFs using MPE-approximate Likes(U, J) concentrates around the observed average maximum likelihood with a 5.0 step size on the first rating of user U and item J, and the global average. 200 images of each data set and test on the last fifty. For training, we maximize the data log-likelihood of We compare our HL-MRF model to a current state- uniformly random held-out pixels for each training of-the-art latent factors model, Bayesian probabilis- image, allowing for generalization throughout the im- tic matrix factorization (BPMF) [20]. BPMF is a age. Table 5 lists our results and others reported by fully Bayesian treatment and, as such, is considered Poon and Domingos [18]. HL-MRFs produce the best “parameter-free”; the only parameter that must be mean squared error on the left- and bottom-half set- specified is the rank of the decomposition.For our ex- tings for the Caltech101 set and the left-half setting in periments, we use Xiong et al.’s code [2010]. Since the Olivetti set. Only sum product networks produce BPMF does not train a model, we allow BPMF to use lower error on the Olivetti bottom-half faces. Some all of the training matrix during the prediction phase. reconstructed faces are displayed in Figure 1, where Table 4 lists the normalized mean squared er- the shallow, pixel-based HL-MRFs produce compara- ror (NMSE) and normalized mean absolute error bly convincing images to sum-product networks, es- (NMAE), averaged over 10 random splits. Though pecially in the left-half setting, where HL-MRF can BPMF produces the best scores, the improvement over learn which pixels are likely to mimic their horizontal HL-MRF-L (LME) is not significant in NMAE. mirror. While neither method is particularly good at
Table 5: Mean squared errors per pixel for image reconstruction. HL-MRFs produce the most accurate recon- structions on the Caltech101 and the left-half Olivetti faces, and only sum-product networks produce better reconstructions on Olivetti bottom-half faces. Scores for other methods are taken from Poon and Domingos [18]. HL-MRF-Q (MLE) SPN DBM DBN PCA NN Caltech-Left 1741 1815 2998 4960 2851 2327 Caltech-Bottom 1910 1924 2656 3447 1944 2575 Olivetti-Left 927 942 1866 2386 1076 1527 Olivetti-Bottom 1226 918 2401 1931 1265 1793 Figure 1: Example results on image reconstruction of Caltech101 (left) and Olivetti (right) faces. From left to right in each column: (1) true face, left side predictions by (2) HL-MRFs and (3) SPNs, and bottom half predictions by (4) HL-MRFs and (5) SPNs. SPN reconstructions are downloaded from Poon and Domingos [18]. reconstructing the bottom half of faces, the qualitative variety of domains. HL-MRFs admit fast, convex in- difference between the deep SPN and the shallow HL- ference. The MPE inference algorithm we introduce MRF reconstructions is that SPNs seem to hallucinate is applicable to the full class of HL-MRFs. With this different faces, often with some artifacts, while HL- fast, general algorithm, we are the first to show results MRFs predict blurry shapes roughly the same pixel using quadratic HL-MRFs on real-world data. In our intensity as the observed, top half of the face. The experiments, HL-MRFs match or exceed the predic- tendency to better match pixel intensity helps HL- tive performance of state-of-the-art methods on four MRFs score better quantitatively on the Caltech101 diverse tasks. The natural mapping between hinge- faces, where the lighting conditions are more varied. loss potentials and logic rules makes HL-MRFs easy to define and interpret. Training and predicting with these HL-MRFs takes lit- tle time. In our experiments, training each model takes about 45 minutes on a 12-core machine, while predict- Acknowledgements ing takes under a second per image. While Poon and Domingos [18] report faster training with SPNs, both This work was supported by NSF grants CCF0937094 HL-MRFs and SPNs clearly belong to a class of faster and IIS1218488, and IARPA via DoI/NBC contract models when compared to DBNs and DBMs, which number D12PC00337. The U.S. Government is autho- can take days to train on modern hardware. rized to reproduce and distribute reprints for govern- mental purposes notwithstanding any copyright anno- tation thereon. Disclaimer: The views and conclusions 6 CONCLUSION contained herein are those of the authors and should not be interpreted as necessarily representing the of- We have shown that HL-MRFs are a flexible and inter- ficial policies or endorsements, either expressed or im- pretable class of models, capable of modeling a wide plied, of IARPA, DoI/NBC, or the U.S. Government.
References [14] D. Lowd and P. Domingos. Efficient weight learn- ing for Markov logic networks. In Principles and [1] L. An and P. Tao. The DC (difference of convex Practice of Knowledge Discovery in Databases, functions) programming and DCA revisited with 2007. DC models of real world nonconvex optimization problems. Annals of Operations Research, 133: [15] A. Martins, M. Figueiredo, P. Aguiar, N. Smith, 23–46, 2005. and E. Xing. An augmented Lagrangian approach to constrained MAP inference. In International [2] F. Bach and M. Jordan. Thin junction trees. In Conference on Machine Learning, 2011. Neural Information Processing Systems, 2001. [3] S. Bach, M. Broecheler, L. Getoor, and [16] O. Meshi and A. Globerson. An alternating di- D. O’Leary. Scaling MPE inference for con- rection method for dual MAP LP relaxation. In strained continuous Markov random fields with European Conference on Machine Learning and consensus optimization. In Neural Information Knowledge Discovery in Databases, 2011. Processing Systems, 2012. [17] J. Neville and D. Jensen. Relational dependency [4] J. Besag. Statistical analysis of non-lattice data. networks. Journal of Machine Learning Research, Journal of the Royal Statistical Society, 24(3): 8:653–692, 2007. 179–195, 1975. [18] H. Poon and P. Domingos. Sum-product net- [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and works: A new deep architecture. In Uncertainty J. Eckstein. Distributed optimization and statisti- in Artificial Intelligence, 2011. cal learning via the alternating direction method [19] M. Richardson and P. Domingos. Markov logic of multipliers. Foundations and Trends in Ma- networks. Machine Learning, 62(1-2):107–136, chine Learning, 3(1), 2011. 2006. [6] M. Broecheler, L. Mihalkova, and L. Getoor. [20] R. Salakhutdinov and A. Mnih. Bayesian prob- Probabilistic similarity logic. In Uncertainty in abilistic matrix factorization using Markov chain Artificial Intelligence, 2010. Monte Carlo. In International Conference on Ma- [7] M. Collins. Discriminative training methods for chine Learning, 2008. hidden Markov models: theory and experiments [21] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gal- with perceptron algorithms. In Empirical Meth- lagher, and T. Eliassi-Rad. Collective classifica- ods in Natural Language Processing, 2002. tion in network data. AI Magazine, 29(3):93–106, [8] P. Domingos and W. Webb. A tractable first- 2008. order probabilistic logic. In AAAI Conference on [22] D. Sontag and T. Jaakkola. Tree block coordinate Artificial Intelligence, 2012. descent for MAP in graphical models. In Artificial [9] K. Goldberg, T. Roeder, D. Gupta, and Intelligence and Statistics, 2009. C. Perkins. Eigentaste: A constant time collabo- [23] B. Taskar, M. Wong, P. Abbeel, and D. Koller. rative filtering algorithm. Information Retrieval, Link prediction in relational data. In Neural In- 4(2):133–151, July 2001. formation Processing Systems, 2003. [10] B. Huang, A. Kimmig, L. Getoor, and J. Golbeck. [24] B. Taskar, C. Guestrin, and D. Koller. Max- A flexible framework for probabilistic models of margin Markov networks. In Neural Information social trust. In Conference on Social Comput- Processing Systems, 2004. ing, Behavioral-Cultural Modeling, & Prediction, 2013. [25] M. Wainwright and M. Jordan. Graphical mod- els, exponential families, and variational infer- [11] T. Huynh and R. Mooney. Online max-margin ence. Foundations and Trends in Machine Learn- weight learning for Markov logic networks. In ing, 1(1-2), January 2008. SIAM International Conference on Data Mining, 2011. [26] L. Xiong, X. Chen, T. Huang, J. Schneider, and J. Carbonell. Temporal collaborative filtering [12] T. Joachims, T. Finley, and C. Yu. Cutting-plane with Bayesian probabilistic tensor factorization. training of structural SVMs. Machine Learning, In SIAM International Conference on Data Min- 77(1):27–59, 2009. ing, 2010. [13] A. Kimmig, S. Bach, M. Broecheler, B. Huang, and L. Getoor. A short introduction to probabilis- tic soft logic. In NIPS Workshop on Probabilis- tic Programming: Foundations and Applications, 2012.
You can also read