From Statistical Relational to Neural Symbolic Artificial Intelligence: a Survey.
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
From Statistical Relational to Neural Symbolic Artificial Intelligence: a Survey. arXiv:2108.11451v1 [cs.AI] 25 Aug 2021 Giuseppe Marra1 , Sebastijan Dumančić1 , Robin Manhaeve1 , and Luc De Raedt1,2 firstname.lastname@kuleuven.be 1 KU Leuven, Department of Computer Science and Leuven.AI 2 Örebro University, Center for Applied Autonomous Sensor Systems August 27, 2021 Abstract Neural-symbolic and statistical relational artificial intelligence both integrate frameworks for learning with logical reasoning. This survey identifies several parallels across seven different dimensions between these two fields. These cannot only be used to characterize and position neural- symbolic artificial intelligence approaches but also to identify a number of directions for further research. 1 Introduction The integration of learning and reasoning is one of the key challenges in artificial intelligence and machine learning today, and various communities have been addressing it. That is especially true for the field of neural-symbolic computation (NeSy) [10, 21], where the goal is to integrate symbolic reasoning and neural networks. NeSy already has a long tradition, and it has recently attracted a lot of attention from various communities (cf. the keynotes of Y. Bengio and H. Kautz on this topic at AAAI 2020, the AI Debate [9] between Y. Bengio and G. Marcus ). Another domain that has a rich tradition in integrating learning and reason- ing is that of statistical relational learning and artificial intelligence (StarAI) [39, 85]. But rather than focusing on integrating logic and neural networks, it is centred around the question of integrating logic with probabilistic reasoning, more specifically probabilistic graphical models. Despite the common interest in combining symbolic reasoning with a basic paradigm for learning, i.e., proba- bilistic graphical models or neural networks, it is surprising that there are not more interactions between these two fields. 1
This discrepancy is the key motivation behind this survey: it aims at pointing out the similarities between these two endeavours and in this way it wants to stimulate cross-fertilization. In doing so, we start from the literature on StarAI, following the key concepts and techniques outlined in several textbooks and tutorials such as [92, 85], because it turns out that the same issues and techniques that arise in StarAI apply to NeSy as well. As the key contribution of this survey, we identify seven dimensions that these fields have in common and that can be used to categorize both StarAI and NeSy approaches. These seven dimensions are concerned with (1) type of logic, (2) model vs proof-based inference, (3) directed vs undirected models, (4) logical semantics, (5) learning parameters or structure, (6) representing entities as symbols or sub-symbols, and (7) integrating logic with probability and/or neural computation. We provide evidence for our claim by positioning a wide variety of StarAI and NeSy systems along these dimensions and pointing out analogies between them. This provides not only new insights into the relationships between StarAI and NeSy, but it also allows one to carry over and adapt techniques from one field to another. Thus the insights provided in this paper can be used to create new opportunities for cross-fertilization between StarAI and NeSy, by focusing on those dimensions that have not been fully exploited yet. The classification of numerous methods within the same categories sometimes comes at the cost of oversimplification. Thus, the individual dimensions are accompanied by examples of specific methods. For each example, a final discussion frames the specific technique inside the dimension. With this approach, we present a very broad overview of the research field but we still provide specific intuitions on how the different features are implemented. Unlike some other perspectives on neural-symbolic computation [10, 21], the present survey limits itself to a logical and probabilistic perspective, which it inherits from StarAI, and to developments in neural-symbolic computation that are consistent with this perspective. Furthermore, it focuses on representative and prototypical systems rather than aiming at completeness (which would not be possible given the fast developments in the field). Several other surveys about neural symbolic AI have been proposed. An early overview of neural-symbolic computation is that of [4]. Unlike the present survey it focuses very much on a logical and a reasoning perspective. Today, the focus has shifted very much to learning. More recently, [59] analysed the intersection between NeSy and graph neural networks (GNN). [105] described neural symbolic systems in terms of the composition of blocks described by few patterns, concerning processes and exchanged data. In contrast, this survey is more focused on the underlying principles that govern such a composition. [20] exploits instead a neural network viewpoint by investigating in which components (i.e. input, loss or structure) symbolic knowledge is injected. The following sections of the paper each describe one dimension. We summa- rize various neural-symbolic approaches along these dimensions in Table 1. For ease of writing, we do not always repeat the references to these approaches in the paper, the table mentions the key reference for each of them. 2
Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6 Dimension 7 (P)ropositional (L)ogic Logic (L/l) (R)elational (M)odel-based (D)irected (P)arameter (S)ymbols (P)robability Probability(P/p) (FOL) (P)roof -based (U)ndirected (S)tructure (Sub)symbols (F)uzzy Neural (N/n) (LP) ∂ILP [31] R P D F P+S S Ln DeepProbLog [64] LP P D P P S+Sub LpN DiffLog [99] R P D F P+S S Ln LRNN [125] LP P D F P+S S+Sub Ln LTN [26] FOL M U F P Sub lN NeuralLP [118] R M D L P S Ln NeurASP [119] LP P D P P S LpN 3 NGS [60] LP P D L P S Ln NLM [27] R M D L P+S S Ln NLog [103] LP P D L P S Ln NLProlog [112] LP P D P P+S S+Sub LpN NMLN [69] FOL M U P P+S S+Sub lPN NTP [90] R P D L P+S S+Sub Ln RNM [67] FOL M U P P S+Sub lPN SL [114] P M U P P S+Sub lPN SBR [25] FOL M U F P Sub lN Tensorlog [13] R P D P P S+Sub Ln Table 1: Taxonomy of a (non-exhaustive) list of NeSy models according to the 7 dimensions outlined in the paper.
2 Logic Let us start by providing an introduction to clausal logic. We focus on clausal logic as it is a standard form to which any first order logical formula can be converted. In clausal logic, everything is represented in terms of clauses. More formally, a clause is an expression of the form h1 ∨ ... ∨ hk ← b1 ∧ ... ∧ bn . The hk are head literals or conclusions, while the bi are body literals or conditions. Clauses with no conditions (n = 1) and one conclusion (k = 1) are facts. Clauses with only one conclusion (k = 1) are definite clauses. Definite clauses are the basic constructs used in the programming language Prolog. Example 1 (Propositional Clausal logic). Consider the famous alarm problem expressed as a set of definite clauses. burglary. hears_alarm_mary. earthquake. hears_alarm_john. alarm ← earthquake. alarm ← burglary. calls_mary ← alarm,hears_alarm_mary. calls_john ← alarm,hears_alarm_john. In the above example, the literals did not have any internal structure, we were working in propositional logic. This contrasts with first-order logic in which the literals take the form p(t1 , ..., tm ), with p a predicate of arity m and the ti terms, that is, constants, variables, or structured terms of the form f (t1 , ..., tq ), where f is a functor and the ti are again terms. Intuitively, constants represent objects or entities, functors represent functions, variables make abstraction of specific objects, and predicates specify relationships amongst objects. The subset of first order logic where there are no functors is called relational logic. Example 2 (Clausal logic). In contrast to the previous example, we now write the theory in a more compact manner using first order logic. By convention, constants start with a lowercase letter, while variables start with an uppercase. Essential is the use of the variable X in the rule for the calls predicate, which is implicitly universally quantified, and which states that X will call when the alarm goes off, and X hears_alarm. burglary. hears_alarm(mary). 4
earthquake. hears_alarm(john). alarm ← earthquake. alarm ← burglary. calls(X) ← alarm, hears_alarm(X). Let us also introduce some basic concepts that will be useful for the rest of the paper. When an expression (i.e, clause, atom or term) does not contain any variable it is called ground. A substitution θ is an expression of the form {X1 /c1 , ..., Xk /ck } with the Xi different variables, the ci terms. Applying a substitution θ to an expression e (term, atom or clause) yields the instantiated expression eθ where all variables Xi in e have been simultaneously replaced by their corresponding terms ci in θ. We can take for instance the atom calls(X) and apply the substitution {X/mary} to yield calls(mary). Propositional logic is a subset of relational logic, which itself a subset of first order logic. Therefore, first order logic is also more expressive than relational logic, which itself is more expressive than propositional logic. Propositional and first-order logic form the two extremes on the spectrum of logical reasoning frameworks and are essential for understanding the capabilities of StarAI and NeSy systems. Propositional logic is the simplest and, consequently, the least expressive formalism. However, due to the mentioned restrictions, inference for propositional logic is decidable, whereas it is only semi-decidable for first order logic. The major weakness of propositional restriction is that specifying complex knowledge can be tedious and requires substantial effort. The strengths and weaknesses of first-order logic are complementary: due to its expressiveness, complex problems are easy to specify but come with a computational price. Relational logic is somewhat in the middle and is more in line with a database perspective. Interestingly, any problem expressed in first-order logic can be equivalently expressed in relational logic; any problem expressed in relational logic can likewise be expressed in propositional logic by grounding out the clauses [83, 34]. Grounding is the process whereby all possible substitutions that ground the clause are applied. Notice that grounding a first order logical theory may result in an infinite set of ground clauses (when there are functors), and an exponentially larger set of clauses (when working with finite domains). The rules of chess can fit a single page if written in first-order logic, while they take several hundred pages if grounded out in propositional logic. StarAI and NeSy along Dimension 1 Understanding which type of logic a StarAI or NeSy system is built is important for assessing the capabilities of that system. StarAI approaches typically focus on the most expressive logics, such as logic programming [22, 94] and first-order logic [88]. For NeSy, systems based on propositional logic, like Semantic Loss (SL) [114] can do the simplest logical reasoning, but often can do it efficiently. Datalog and relational 5
logic-based systems are well-suited for problems that require database queries. Datalog systems are the most predominant ones in NeSy, like DiffLog [99], θILP [31], Lifted Relational Neural Networks [125] and Neural Theorem Provers [90]. Systems leveraging answer-set programming (see below), like NeurASP [119], are also suited for database queries, but also common-sense reasoning and reasoning with preferences. Systems based on logic programming and Prolog, like DeepProbLog [64], NLog [103], NLProlog [112] are suited for tasks that require a full-fledged programming language for, e.g., data structure or state manipulations. Grammars, like CFG[50] or unification-based-grammars [98], have been often targeted in the logic programming community, cf. Definite Clause Grammars [34]. The close nature between the two approaches has given rise to grammar-based neural symbolic systems, like NGS [60] and DeepStochLog [113], that are very close to logic programming based systems. Finally, some systems are not restricted to definite clauses and allow general first-order-logic theories, like Logic Tensor Networks [26], Semantic Based Regularization [25], Relational Neural Machines [67] and Logical Neural Networks [89]. Dimension 1: Propositional, Relational, First Order Logic Propositional logic is a subset of relational logic, which itself a subset of first order logic. Therefore, first order logic is also more expressive than relational logic, which itself is more expressive than propositional logic. Propositional and first-order logic form the two extremes on the spectrum of logical reasoning frameworks and are essential for understanding the capabilities of StarAI and NeSy systems. Propositional logic is the simplest and, consequently, the least expressive formalism but inference for propositional logic is decidable. First order logic is a more expressive and compact formalism but inference is only semi-decidable. Relational logic is somewhat in the middle and is more in line with a database perspective. 3 About proofs and models, and rules and con- straints So far we have introduced the syntax of clausal logic but have neither discussed semantics nor inference. The semantics of different approaches is usually defined in terms of models. For inference, one is usually interested in finding proofs for certain logical queries or one wants to find assignments to certain variables that satisfy a given theory. In the setting of logic programming, definite clauses (rules) are interpreted as computational rules (compute h by computing b1 , ... , and bn ) and are typically used for forward or backward inference to prove that certain atoms hold. Inference typically proceeds by searching for proofs for queries as illustrated in 6
Example 3. This gives rise to a proof-theoretic perspective on logic. Although proofs and proof-trees in Prolog are built using SLD- or SLDNF-resolution, we depict the proofs as an AND-OR tree for ease of exposition. Example 3 (Logic programs and proofs). Consider the following logic program: burglary. hears_alarm_mary. earthquake. hears_alarm_john. alarm :- earthquake. alarm :- burglary. calls_mary :- alarm,hears_alarm_mary. calls_john :- alarm,hears_alarm_john. and the proofs for the query calls_mary as an AND/OR tree. calls_mary AND alarm hears_alarm_mary OR burglary earthquake In logic programs, we use Prolog’s ":-" instead of ←, to differentiate its semantics from first order logic implications. The rules for alarm state that there will be an alarm if there is a burglary or an earthquake. On the other hand, we have the model theoretic perspective on logic that relies on the notions of interpretations and models. In this paper, we restrict to Herbrand models and Herbrand interpretations. The Herbrand base of a set of clauses is the set of ground atoms that can be constructed using the predicates, functors and constants occurring in the theory. Definition 1 (Interpretation and possible world). A Herbrand interpretation, or a possible world, is a set of truth assignments {a1 = v1 , ..., an = vn }, where a1 , ..., an are all the ground atoms in the Herbrand base and vi are the corre- sponding assigned truth values. Equivalently, one can define an interpretation as a subset of the Herbrand 7
base containing only the true atoms, while all the others are false. A Herbrand interpretation is a model of a clause h1 ∨ ... ∨ hk ← b1 ∧ ... ∧ bn if for every substitution θ such that all b1 θ ∧ ... ∧ bn θ is true in the interpretation, at least one of the hi θ is true in the interpretation as well. An interpretation I is a model of a theory T , and we write I |= T , if it is a model of all clauses in the theory. We say that the theory is satisfiable. The satisfiability problem, that is, deciding whether a theory has a model, is one of the most fundamental ones in computer science (cf. the SAT problem for propositional logic). Differently from proof-based techniques, the model-theoretic ones use logic as constraints on a set of variables, that is, that the variables are related to one another, without giving any directed relationships between them. More details on these connections can be found in [85, 34]. Example 4 (Model-theoretic). Consider the theory composed of the fol- lowing clauses: calls_mary ← hears_alarm_mary ∧ alarm calls_john ← hears_alarm_john ∧ alarm alarm ← burglary alarm ← earthquake A model of the previous theory is the set: M = {burglary, hears_alarm_john, alarm, calls_john} By considering all the elements of this set True and all the others False the four clauses are satisfied. The model theoretic semantics of clausal logic differs from that of logic programs in the form of definite clauses. The model-theoretic semantics of a clausal theory corresponds to the set of all Herbrand models, while for definite clausal logic programs it is given by the smallest Herbrand model with respect to set inclusion, the so-called least Herbrand model (LHM), which is unique. We say that a logic program T entails an atom denoted T |= e if and only if e is true in the least Herbrand model of T . This corresponds to making the closed world assumption, every statement that cannot be proven is assumed to be false. A least Herbrand semantics allows using definite clauses as a programming language and they form the basis for "pure" Prolog. It allows naturally supporting data structures and compute, for instance, transitive closures, which is impossible in standard first order logic. This use of a least Herbrand model is important because there are models of a definite clause theory that are not minimal, as shown in Example 5. Example 5 (Model-based vs logic program semantics). Let us consider the following set of clauses: 8
edge(1,2) ← True path(A,B) ← edge(A,B) path(A,B) ← edge(A,C) ∧ path(C,B) If we consider them as clauses of a logic program, then the unique least Herbrand model (LHM) is: M LHM = {edge(1, 2), path(1, 2)} On the other hand, the model-based semantics allows for all the models of the theory: M1 = M LHM = {edge(1, 2), path(1, 2)} M2 = {edge(1, 2), path(1, 2), path(1, 1)} M2 = {edge(1, 2), path(1, 2), path(2, 1)} ... These differences are also important for StarAI and NeSy. Indeed, StarAI and NeSy systems based on first-order logic, such as Markov logic networks [88] and Probabilistic Soft Logic [3], cannot model transitive closure, which can lead to unintuitive inference results. They view, as we shall show in Example 14, logical formulae as (soft) constraints. In contrast, systems based on logic programming, such as Problog [22], have no difficulties with transitive closure. They use the clauses as inference rules to build proofs and derivations. It is worth noting that there are various flavours of logic programming. Datalog is the relational subset of definite clauses logic, it is strongly related to database languages such as SQL. Furthermore, because it prohibits the use of structured terms, it guarantees termination. Answer-set programming [37] is a popular logic programming framework that is not restricted to definite clauses and that takes the constraint perspective. Answer-set programs can have multiple models and support features such as soft and hard constraints and preferences. For a detailed introduction to answer-set programs, we refer to [37]. The difference between the logic programming perspective and that of full clausal logic can thus be related to the difference between a proof theoretic and a model theoretic perspective. In the model theoretic perspective, we view the clauses as constraints that need to be satisfied, while in the proof theoretic perspective, we view them as rules to answer particular queries. This is clear when looking at propositional logic: propositional definite clauses can be viewed as simple IF ... THEN rules that can be chained in the forward or the backward direction in order to derive new conclusions1 , while propositional clauses in a SAT theory are disjunctive constraints that need to be satisfied. StarAI along Dimension 2 Many StarAI systems use logic as a kind of template to ground out the relational model in order to obtain a grounded 1 General clauses can be used in proofs. 9
model and perform inference. This is akin to the model-based perspective of logic. This grounded model can be a graphical model, or alternatively, a ground weighted logical theory on which traditional inference methods apply, such as belief propagation or weighted model counting. This is used in well known systems such as Markov Logic Networks (MLNs) [88], Probabilistic Soft Logic (PSL) [3], Bayesian logic programs (BLPs) [54] and probabilistic relational models (PRMs) [36]. Some systems like PRMs and BLPs additionally use aggregates, or combining rules, in order to combine multiple conditional probability distributions into one using, e.g., noisy-or. Alternatively, one can follow a proof or trace based approach to define the probability distribution and perform inference. This is akin to what happens in probabilistic programming (cf. also [92]), in StarAI frameworks such as proba- bilistic logic programs (PLPs) [86], probabilistic databases [106] and probabilistic unification based grammars such as Stochastic Logic Programs (SLPs) [74]. Just like pure logic supports the model-theoretic and proof-theoretic perspectives, both perspectives have been explored in parallel for some of the probabilistic logic programming languages such as ICL [81] and ProbLog [32]. NeSy along Dimension 2 These two perspectives carry over to the neural- symbolic methods. Approaches like LRNN, LNN, NTPs, DeepProblog, ∂ILP, DiffLog, NeuralLP and Neural Logic Machines (NLM) [27] are proof-based. The probabilities or certainties that these systems output are based on the enumerated proofs, and they can also learn how to combine them. In contrast, approaches of NeurASP, Logic Tensor Networks (LTNs) [26], Semantic Based Regularization (SBR) [25], SL, Relational Neural Machines (RNM) [67] and Neural Markov Logic Networks (NMLN) [69] are all based on the model-theoretic perspective. Learning in these models is done through learning the (shared) parameters over the ground model and inference is based on possible groundings of the model. Dimension 2: Rules or Constraints In the model theoretic perspective, we view the clauses as constraints that need to be satisfied, while in the proof theoretic perspective, we view them as rules to answer particular queries. 4 Probabilistic graphical models Probabilistic graphical models [58] are graphical models that compactly represent a (joint) probability distribution P (X1 , ..., Xn ) over n discrete or continuous random variables X1 , ..., Xn . The key idea is that the joint factorizes over some factors f i specified over subsets X i of the variables {X1 , ..., Xn }. 1 P (X1 , ..., Xn ) = f1 (X 1 ) × ... × fk (X k ) Z 10
The random variables correspond to the nodes in the graphical structure, and the factorization is determined by the edges in the graph. There are two classes of graphical models: directed, or Bayesian networks, and undirected, or Markov Networks. In Bayesian networks, the underlying graph structures is a directed acyclic graph, and the factors f i (Xi |parents(Xi )) cor- respond to the conditional probabilities P (Xi |parents(Xi )), where parents(Xi ) denotes the set of random variables that are a parent of Xi in the graph. In Markov networks, the graph is undirected and the factors f i (X i ) correspond to the set of nodes X i that form (maximal) cliques in the graph. Furthermore, the factors are non negative and Z is a normalisation constant. 4.1 StarAI along Dimension 3 The distinction between the directed and undirected graphical models [58], has led to two distinct types of StarAI systems [85]. The first type of StarAI systems generalizes directed models and resembles Bayesian networks. It includes well-known representations such as plate notation [58], probabilistic relational models (PRMs) [36], probabilistic logic programs (PLPs) [86], and Bayesian logic programs (BLPs) [54]. Today the most typical and popular representatives of this category are the probabilistic (logic) programs. Probabilistic logic programs were introduced by Poole [80] and the first learning algorithm was by Sato Sato [93]. Probabilistic logic programs are essentially definite clause programs where every fact is annotated with the probability that it is true. This then results in a possible world semantics. The reason why probabilistic logic programs are viewed as directed models is clear when looking at the derivations for a query, cf. Example 3. At the top of the AND-OR tree, there is the query that one wants to prove and the structure of the tree is that of a directed graph (even though need not be acyclic). One can also straightforwardly map directed graphical models, i.e., Bayesian networks, on such probabilistic logic programs by associating one definite clause to every entry in a conditional probability table, i.e., a factor of the form P (X|Y1 , ..., Yn ). Assuming boolean random variables, each entry x, y1 , ..., yn with parameter value v can be represented using the definite clause X(x) ← Y1 (y1 ) ∧ ... ∧ Yn (yn ) ∧ px,y1 ,...,yn and probabilistic facts v :: px,y1 ,...,yn . A probabilistic version of Example 3 is shown in Example 6 using the syntax of ProbLog [22]. Example 6 (ProbLog). We show a probabilistic extension for the alarm program using ProbLog notation. 0.1::burglary. 0.3::hears_alarm(mary). 0.05::earthquake. 0.6::hears_alarm(john). alarm :- earthquake. alarm :- burglary. 11
burglary earthquake hears_alarm(john) alarm hears_alarm(mary) calls(john) calls(mary) Figure 1: The Bayesian network corresponding to the ProbLog program in Example 6 calls(X) :- alarm, hears_alarm(X). This program can be mapped to the Bayesian network in Figure 1 This probabilistic logic program defines a distribution p over possible worlds ω. Let P be a problog program and F = {p1 : c1 , · · · , pn : cn } be the set of ground probabilistic facts ci of the program and pi their corresponding probabilities. Problog defines a probability distribution over ω in the following way: if ω 6|= P 0, Y Y p(ω) = pi · (1 − p j ), if ω |= P ci ∈ω:ci =T cj ∈ω:cj =F The second type of StarAI systems generalizes undirected graphical models like Markov networks or random fields. The prototypical example is Markov Logic Networks (MLNs) [88], and also Probabilistic Soft Logic (PSL) [3] follows this idea. Undirected StarAI methods define a set of weighted clauses w : h1 ∨ ... ∨ hk ← b1 ∧ ... ∧ bm , and a domain D. The idea is that once the clauses are grounded over the domain D, they become soft constraints. The higher the weight of a ground clause, the less likely possible worlds that violate these constraints are. In the limit, when the weight is +∞ the constraint must be satisfied and becomes a pure logical constraint. The weighted clauses specify a more general relationship between the conclusion and the condition than the definite clauses of directed models. While clauses of undirected models can still be used in (resolution) theorem provers, they are usually viewed as constraints that relate these two sets of atoms as is common in Answer Set Programming [38]. Undirected models can be mapped to an undirected probabilistic graphical model in which there is a one-to-one correspondence between grounded weighted clauses and factors, as we show in Example 7. 12
burglary earthquake hears_alarm(john) alarm hears_alarm(mary) calls(john) calls(mary) Figure 2: The Markov Field corresponding to the Markov logic network in Example 7 Example 7 (Markov Logic Networks). We show a probabilistic extension of the theory in Example 4 using the formalism of Markov Logic Networks. We use a First Order language with domain D = {john, mary} and weighted clauses α1 and α2 , i.e.: α1 : 1.5 :: calls(X) ← hears_alarm(X) ∧ alarm α2 : 2 :: alarm ← burglary α3 : 2 :: alarm ← earthquake In Figure 2, we show the corresponding Markov field. A Markov Logic Network defines a probability distribution over possible worlds as follows. Let A = [α1 , · · · , αn ] be a set of logical clauses and let B = [β1 , · · · , βn ] the corresponding positive weights. Let θj be a grounding substitution for the clause αi over the domain D of interest and αi θj the corresponding grounded clause. Finally, let 1(ω, αi θi ) be an indicator function, evaluating to 1 if the ground clause is true in ω, 0 otherwise. The probabilistic semantics of Markov Logic is the distribution 1 1(ω, αi θj ) X X p(ω) = exp βi Z i j Intuitively, in MLNs, a world is more probable if it makes many ground clauses true. 4.2 NeSy along Dimension 3 Many neural symbolic systems retain the directed nature of logical inference and can be classified as directed models. The most prominent members of this 13
category are NeSy systems based on Prolog or Datalog, such as Neural Theorem Provers (NTPs) [90], NLProlog [112], DeepProbLog [64] and DiffLog [99]. Lifted Relational Neural Networks (LRNNs) [125] and ∂ILP [31] are other examples of non-probabilistic directed models, where weighted definite clauses are compiled into a neural network architecture in a forward chaining fashion. The systems that imitate logical reasoning with tensor calculus, Neural Logic Programming (NeuralLP) [118] and Neural Logic Machines (NLM) [27], are likewise instances of directed logic. An example of a directed NeSy model is given in Example 8. Example 8 (Knowledge-Based Artificial Neural Networks). Knowledge- Based Artificial Neural Networks (KBANN) is one of the first methods to use definite clausal logic to template a neural network. They incorporate many of the common patterns of directed NeSy models. KBANN turns a program into a neural network in several steps: 1. KBANN starts from a definite clause program. 2. The program is turned into an AND-OR tree. 3. The AND-OR tree is turned into a neural network with a similar structure. Nodes are divided into layers. The weights and the biases are set such that evaluating the network returns the same outcome of querying the program. 4. New hidden units are added. Hidden units play the role of unknown rules that need to be learned. They are initialized with zero weights; i.e. they are inactive. 5. New links are added from each layer to the upper one, obtaining the final neural network. An example of this process is shown in Figure 3. KBANN need some restrictions over the kind of rules. In particular, the rules are assumed to be conjunctive, nonrecursive, and variable-free. Many of these restrictions are removed by more recent systems. Differently from the directed class, the undirected NeSy approaches do not exploit clauses to perform logical reasoning (e.g. using resolution) but consider logic as a constraint on the behaviour of a neural model. Rules are then used as an objective function for a neural model more than as a template for a neural architecture. So the indirectness of rules have a very large impact on how the symbolic part is exploited w.r.t. the directed methods. A large group of approaches, including Semantic Based regularization (SBR) [25], Logic Tensor Networks(LTN) [26], Semantic Loss (SL) [114] and DL2 [33] exploits logical knowledge as a soft regularization constraint over the hypothesis space in a way that favours solutions consistent with the encoded knowledge. SBR and LTN compute atom truth assignments as the output of a neural network and translates the provided logical formulas into a real valued regularization loss 14
calls_mary AND alarm :- earthquake. alarm hears_alarm_mary alarm :- burglary. OR calls_mary :- alarm, burglary earthquake hears_alarm_mary. (1) (2) (3) (4) (5) Figure 3: Knowledge-Based Artificial Neural Network. Network creation process. (1) the initial logic program; (2) the AND-OR tree for the query calls_mary; (3) mapping the tree into a neural network; (4) adding hidden neurons, (5) adding interlayer connections. term using fuzzy logic. SL uses marginal probabilities of the target atoms to define the regularization term and relies on arithmetic circuits [18] to evaluate it efficiently, as detailed in Example 9. DL2 defines a numerical loss providing no specific semantics (probability or fuzzy), which allows including numerical variables in the formulas (e.g. by using a logical term x > 1.5). Another group of approaches, including Neural Markov Logic Networks (NMLN) [69] and Relational Neural Machines (RNM) [67] extend MLNs, allowing factors of exponential distributions to be implemented as neural architectures. Finally, [91, 24] compute ground atoms scores as dot products between relation and entities embeddings; implication rules are then translated into a logical loss through a continuous relaxation of the implication operator. Example 9 (Semantic Loss). The Semantic Loss [114] is an example of an undirected model where (probabilistic) logic is exploited as a regularization term in training a neural model. 15
Let p = [p1 , . . . , pn ] be a vector of probabilities for a list of propositional variables X = [X1 , . . . , Xn ]. In particular, pi denotes the probability of variable Xi being True and corresponds to a single output of a neural net having n outputs. Let α be a logic sentence defined over X. Then, the semantic loss between α and p is: X Y Y Loss(α, p) ∝ − log pi (1 − pi ). x|=α i:x|=Xi i:x|=¬Xi The authors provide the intuition behind this loss: The semantic loss is proportional to a negative logarithm of the probability of generating a state that satisfies the constraint when sampling values according to p. Suppose you want to solve a multi-class classification task, where each input example must be assigned to a single class. Then, ones would like to enforce mutual exclusivity among the classes. This can be easily done on supervised examples, by coupling a softmax activation layer with a cross entropy loss. However, there is not a standard way of imposing this constraint for unlabeled data, which can be useful in a semi-supervised setting. The solution provided by the Semantic Loss framework is to encode mutual exclusivity into the propositional constraint β: β = (X1 ∧ ¬X2 ∧ ¬X3 ) ∨ (¬X1 ∧ X2 ∧ ¬X3 ) ∨ (¬X1 ∧ ¬X2 ∧ X3 ) Consider a neural network classifier with three outputs p = [p1 , p2 , p3 ]. Then, for each input example (both labeled or unlabeld), we can build the semantic loss term: L(β, p) = p1 (1 − p2 )(1 − p3 ) + (1 − p1 )p2 (1 − p3 ) + (1 − p1 )(1 − p2 )p3 which can be summed up to the standard cross entropy term for the labeled examples. It is worth comparing this method with KBANN (see Example 8). Here, the logic is turned into a loss function that is used during training. The function constrains the underlying probabilities, but there are no directed or causal relationships among them. Moreover, during evaluation, the probabilities p of the variables are just the outputs of the neural network. On the contrary, in KBANN, the logic is compiled into the architecture of the network and so it will be exploited also at evaluation time to compute the score of the test query. The different focus on the neural or logic part is further investigated in Section 8. 16
Dimension 3: Directed and Undirected models There are two classes of graphical models: in Bayesian networks, the underlying graph structures is a directed acyclic graph, while, in Markov networks, the graph is undirected. This distinction is carried over to NeSy where logical rules are used either to define the forward structure of the neural network or to define a regularization term for the training. 5 Boolean, Probabilistic and Fuzzy logic One of the most important and complex questions in the neural symbolic community is how to integrate the discrete nature of Boolean logic with the continuous nature of neural representations (e.g. embeddings). Boolean logic assigns values to atoms in the set {T rue, F alse} (or {F, T } or {0, 1}), which are interpreted as truth values. Connectives (e.g. ∧, ∨) are mapped to binary functions of truth values, which are usually described in terms of truth tables. Example 10 (Boolean Logic). Let us consider the following propositions: alarm, burglary and earthquake. Defining the semantics of this language is about assigning truth values to the propositions and truth tables to connectives. For example: A B A∨B A B B←A I = {alarm = T, F F F F F T F T T F T T burglary = T, T F T T F F earthquake = F } T T T T T T Once we have defined the semantics of the language, we can evaluate logic sentences, e.g.: alarm ← (burglary ∨ earthquake) = T This evaluation can be performed automatically by parsing the expression in the corresponding expression tree: 17
← alarm ∨ burglary earthquake The truth value of the sentence is computed by evaluating the tree bottom- up. Probabilistic logic uses the distribution semantics [93] as the key concept to integrate Boolean logic and probability2 . The basic idea is that we interpret each propositional binary variable as a binary random variable. Then, a specific assignment ω of values to the random variables, also called a possible world, is just a specific interpretation of the Boolean logic theory. Any joint distribution p(ω) over the random variables is also a distribution over logic interpretations. The probability of an atom or formula α is defined as the probability that any of the possible worlds that are models of α will occur. Since possible worlds are mutually exclusive, this is just the sum of their probabilities: X p(α) = p(ω) (1) ω|=α This is known as the Weighted Model Counting (WMC) problem. Example 11 (Distribution semantics). We can illustrate the distribu- tion semantics by describing a distribution over possible worlds in tabular form by listing all the worlds and the corresponding probabilities. Let B = burglary, E = earthquake, J = hears_alarm_john and M = hears_alarm_mary). We omit deterministic atoms for clarity. Table 2 re- ports all the possible worlds over these four variables and the corresponding probabilities. Suppose we want to compute the probability of the formula burglary ∧ earthquake. This is done by marginalizing over all those worlds (indicated by a ∗ in Table 2), where both burglary and earthquake are true. Fuzzy logic, and in particular t-norm fuzzy logic, assigns a truth value to atoms in the continuous real interval [0, 1]. Logical operators are then turned into real-valued functions, mathematically grounded in the t-norm theory. A t-norm t(x, y) is a real function t : [0, 1] × [0, 1] → [0, 1] that models the logical 2 In this paper, we use the distribution semantics or possible world semantics as representative of the probabilistic approach to logic. While this is the most common solution in StarAI, many other solutions exist [74, 46], whose description is out of the scope of the current survey. A detailed overview of the different flavours of formal reasoning about uncertainty can be found in [47]. 18
B E J M p(ω) F F F F 0.2394 F F F T 0.1026 F F T F 0.3591 F F T T 0.1539 F T F F 0.0126 F T F T 0.0054 F T T F 0.0189 F T T T 0.0081 T F F F 0.0266 T F F T 0.0114 T F T F 0.0399 T F T T 0.0171 T T F F 0.0014 * T T F T 0.0006 * T T T F 0.0021 * T T T T 0.0009 * Table 2: A distribution over possible worlds for the four proposi- tional variables burglary (B), earthquake (E), hears_alarm_john (J) and hears_alarm_mary (M). The ∗ indicates those worlds where burglary ∧ earthquake is true. AND and from which the other operators can be derived. Table 3 shows the most notorious t-norms and the functions corresponding to their connectives. A fuzzy logic formula is mapped to a real valued function of its input atoms. Fuzzy logic generalizes Boolean logic to continuous values. All the different t-norms are coherent with Boolean logic in the endpoints of the interval [0, 1], which correspond to completely true and completely false values. The concept of model in fuzzy logic can be easily recovered from an extension of the model-theoretic semantics of the Boolean logic (see Section 2). Any fuzzy interpretation is a model of a formula if the formula evaluates to 1. Fuzzy logic deals with vagueness, which is different and orthogonal to un- certainty (as in probabilistic logic). This difference is clear when one compares, for example, the fuzzy assignment earthquake = 0.01, which means "very mild earthquake", with p(earthquake = T rue) = 0.01, which means a low probability of an earthquake. Example 12 (Fuzzy logic). Let us consider the same propositional language of Example 10. Defining a fuzzy semantics to this language is about assigning truth degrees to each of the propositions and selecting a t-norm implementation of the connectives. Let us consider the Łukasiewicz t-norm and the following interpretation of the language: 19
Product Łukasiewicz Gödel x∧y x·y max(0, x + y − 1) min(x, y) x∨y x+y−x·y min(1, x + y) max(x, y) ¬x 1−x 1−x 1−x x ⇒ y (x > y) y/x min(1, 1 − x + y) y Table 3: Logical connectives on the inputs x, y when using the fundamental t-norms. I = {alarm = 0.7, burglary = 0.6, earthquake = 0.3} Once we have defined the semantics of the language, we can evaluate logic sentences, e.g.: alarm ← (burglary ∨ earthquake) = min(1, 1 − min(1, burglary + earthquake) + alarm) = 0.8 This evaluation can be performed automatically by parsing the logi- cal sentence in the corresponding expression tree and then compiling the expression tree using the corresponding t-norm operation: ← t← (0.8) alarm OR 0.7 tOR (0.9) burglary earthquake 0.6 0.3 The resulting circuit represents a differentiable function and the truth degree of the sentence is computed by evaluating the circuit bottom-up. 5.1 StarAI along Dimension 4 StarAI is deeply linked to probabilistic logic. The StarAI community has provided several formalisms to define such probability distributions over possible worlds using labeled logic theories. Probabilistic Logic Programs (cf. Example 6) and Markov logic networks (cf. Example 7) are two prototypical frameworks. For example, the distribution in Table 2 is the one modeled by the ProbLog program 20
AND (0.0435) * OR hears_alarm(mary) (0.145) + 0.3 AND AND (0.045) * * (0.1) ¬burglary burglary 1 - 0.1 OR + 0.1 earthquake ¬earthquake 0.05 1-0.05 Figure 4: dDNNF (left) and arithmetic circuit (right) corresponding to the ProbLog program in Example 6 in Example 6. Probabilistic inference (i.e. weighted model counting) is generally intractable. That is why, in StarAI, techniques such as knowledge compilation (KC) [19] are used. Knowledge compilation transforms logical formulae into a new representation in an expensive offline step. For this new representation, a certain set of queries are efficient (i.e. poly-time in the size of the new representation). From a probabilistic viewpoint, this translation solves the disjoint-sum problem, i.e. it encodes in the resulting formula the probabilistic dependencies in the theory. After the translation, the probabilities of any conjunction and of any disjunction can be simply computed by multiplying, resp. summing up, the probabilities of their operands. Therefore, the formula can be compiled into an arithmetic circuit ac(α). The weighted model count of the query formula is computed by simply evaluating bottom up the corresponding arithmetic circuit; i.e. p(α) = ac(α). Example 13 (Knowledge Compilation). Let us consider the ProbLog program in Example 6 and the corresponding tabular form in Table 2. Let us consider the unary formula α = calls(mary). We can Equation 1 to compute the probability that the formula α holds. To do this, we can iterate over the table and sum all the rows where calls(mary) = T , which are those where either burglary = T or earthquake = T and where hears_alarm(mary) = T . We get that p(α) = 0.0435. This method would require to iterate over 2N terms (where N is the number of probabilistic facts). Knowledge compilation compiles the program and the query into a repre- 21
sentation that is logically equivalent. In Figure 4, the target representation is a decomposable, deterministic negative normal form (d-DNNF) [17], for which weighted model counting is poly-time in the size of the formula. Decomposability means that, for every conjunction, the two conjuncts do not share any variables. Deterministic means that, for every disjunction, the two disjuncts are independent. The formula in dNNF can then be straightforwardly turned into an arithmetic circuit by substituting AND nodes with multiplication and OR nodes by summation (i.e. the proba- bility semiring). In Figure 4, we show the dDNNF and the arithmetic circuit of the distribution defined by the ProbLog program in Example 6. The bottom-up evaluation of this arithmetic circuit computes the correct marginal probability p(α) much more efficiently than the naive iterative sum that we have shown before. Even though probabilistic Boolean logic is the most common choice in StarAI, there are approaches using probabilistic fuzzy logic. The most prominent approach is Probabilistic Soft Logic (PSL) [3], as we show in Example 14. Similarly to Markov logic networks, Probabilistic Soft Logic (PSL) defines log linear models where features are represented by ground clauses. However, PSL uses a fuzzy semantics of the logical theory. Therefore, atoms are mapped to real valued random variables and ground clauses are now real valued factors. Example 14 (Probabilistic Soft Logic). Let us consider the logical rule α = calls(X) ← alarm, hears_alarm(X) with weight β . As we have seen in Example 7, Markov Logic translates the formula into a discrete factor by using the indicator functions 1(ω, αθ): φM LN (ω, α) = β 1(ω, α{X/mary}) + β 1(ω, α{X/john}) Instead of discrete indicator functions, PSL translates the formula into a continuous t-norm based function: t(ω, αθ) = min(1 − max(0, alarm + hears_alarm(X) − 1) + calls(X)) and the corresponding potential is then translated into the continuous and differentiable function: φP SL (ω, α) = βt(ω, α{X/mary}) + βt(ω, α{X/john}) Another important task in StarAI is MAP inference. In MAP inference, given the distribution p(ω), one is interested in finding an interpretation ω ? where p is maximum, i.e: ω ? = arg max p(ω) (2) ω 22
When the ω is a boolean interpretation, i.e. ω ∈ {0, 1}n , like in ProbLog or MLN, this problem is known to be strictly related with the Weighted Model Count problem with which it shares the same complexity class. However P in PSL, ω is a fuzzy interpretation, i.e. ω ∈ [0, 1] and p(ω) ∝ n exp i βi φ(ω, αi ) is a continuous and differentiable function. The basic idea exploited by PSL is to compile the function Φ(ω) = P i βi φ(ω, αi ) into a parametric circuit (cf. Example 12, where the set of parameters is represented by the fuzzy interpretation ω. The MAP inference problem can thus be approximated more efficiently than its boolean counterpart (i.e. Markov Logic) by gradient-based techniques. 5.2 NeSy along Dimension 4 We have seen that in StarAI, one can turn inference tasks into the evaluation (as in KC) or gradient-based optimization (as in PSL) of a differentiable parametric circuit. The parameters are scalar values (e.g. probabilities or truth degrees) which are attached to basic elements of the theory (facts or clauses). A natural way of carrying over the StarAI approach to the neural symbolic domain is the reparameterization method. The reparameterization method is to substitute the scalar values assigned to facts or formulas with the output of a neural network. One can interpret this substitution in terms of a different parameterization of the original model. Many probabilistic methods parameterize the underlying distribution in terms of neural components. In particular, as we show in Example 15, DeepProbLog exploits neural predicates to compute the probabilities of probabilistic facts as the output of neural computations over vectorial representations of the constants, which is similar to SL in the propositional counterpart (see Example 9). NeurASP also inherits the concept of neural predicate from DeepProbLog. Example 15 (Probabilistic semantics reparameterization in DeepProbLog). DeepProbLog [64] is a neural extension of the probabilistic logic program- ming language ProbLog. DeepProbLog allows images or other sub-symbolic representations as terms of the program. Let us consider a possible neural extension of the program in Exam- ple 6. We could extend the predicate calls(X) with two extra inputs, i.e. calls(B, E, X). B is supposed to contain an image of a security camera, while E is supposed to contain the time-series of a seismic sensor. We would like to answer queries like calls( , , mary), i.e. what is the probability that mary calls, given that the security camera has captured the image and the sensor the data . DeepProbLog allows answering this query by modeling the following program: nn(nn_burglary, [B]) :- burglary(B). nn(nn_earthquake, [E]) :- earthquake(E). 23
0.3::hears_alarm(mary). 0.6::hears_alarm(john). alarm(B,_) :- burglary(B). alarm(_,E) :- earthquake(E). calls(B,E, X) :- alarm(B,E), hears_alarm(X). Here, the program has been extended in two ways. First, new arguments (i.e. B and E) have been introduced in order to deal with the sub-symbolic inputs. Second, the probabilistic facts burglary and earthquake have been turned into neural predicates. Neural predicates are special probabilistic facts that are annotated by neural networks instead of by scalar probabilities. Inference in DeepProbLog mimics exactly that of ProbLog. Given the query and the program, knowledge compilation is used to build the arithmetic circuit in Figure 5. Since the program is structurally identical to the pure symbolic one in Example 13, the arithmetic circuit is exactly the same, with as only difference now that some leaves of the tree (i.e. probabilities of the facts) can also be neural networks. Given a set of queries that are true, i.e.: D = {calls( , , mary), calls( , , john), calls( , , mary), ...}, we can train the parameters θ of the DeepProbLog program (both neural networks and scalar probabilities) by maximizing the likelihood of the training queries using gradient descent: X max p(q) θ q∈D Similarly to DeepProbLog, NMLN and RNM use neural networks to parame- terize the factors (or the weights) of a Markov Logic Network. [91] computes marginal probabilities as logistic functions over similarity measures between embeddings of entities and relations. An alternative solution to exploit a prob- abilistic semantics is to use knowledge graphs (see also Section 10) to define probabilistic priors to neural network predictions, as done in [101]. SBR and LTN reparametrize fuzzy atoms using neural networks that take as inputs the feature representation of the constants and return the corresponding truth value, as shown in Example 16. Logical rules are then relaxed into soft constraints using fuzzy logic. Many other systems in other communities exploit fuzzy logic to inject knowledge into neural models [43, 61]. All these methods often differ for small implementation details and they can be regarded as variants of a unique conceptual framework. 24
* + 0.3 * * + nn_burglary nn_earthquake Figure 5: A neural reparametrization of the arithmetic circuit in Example 13 as done by DeepProbLog (cf. Example 15). Dashed arrows indicate a negative output, i.e 1 - x Example 16 (Semantic-Based Regularization). Semantic-Based Regular- ization (SBR) [25] is an example of an undirected model where fuzzy logic is exploited as a regularization term in training a neural model. Let us consider a possible grounding for the rule in Example 14: calls(mary) ← alarm,hears_alarm(mary) For each grounded rule r, SBR builds a regularization loss term L(r) in the following way. First, it maps each constant c (e.g. mary) to a set of (perceptual) features xc (e.g. a tensor of pixel intensities xmary ). Each relation r (e.g. calls, hears_alarm) is then mapped to a neural network fr (x), where x is the tensor of features of the input constants and the output is a truth degree in [0, 1]. For example, the atom calls(mary) is mapped to the function call fcalls (xmary ). Propositional variables (e.g. alarm) are mapped to free parameters in [0, 1], e.g. talarm (exactly like in PSL, Example 14). Then, a fuzzy logic t-norm is selected and logic connectives are mapped to the corresponding real valued functions. For example, when the Łukasiewicz t-norm is selected, the implication is mapped to the binary real function f← (x, y) = min(1, 1−y +x) while the conjunction is f∧ (x, y) = max(0, x + y − 1). For the rule above, the Semantic-Based Regularization loss term is (for 25
the Łukasiewicz t-norm): LŁ (r) = min 1, 1 − fcalls (xmary ) + max(0, talarm + fhears_alarm (xmary ) − 1) The aim of Semantic-Based Regularization is to use the regularization term together with classical supervised learning loss function in order to learn the functions associated to the relations, e.g. fcalls and fhears_alarm . It is worth comparing this method with the Semantic Loss one (Example 9). Both methods turn a logic formula (either propositional or first-order) to a real valued function that is used as a regularization term. However, because of the different semantics, these two methods show different properties. On one hand, SL preserves the original logical semantics, by using probabilistic logics. However, due to the probabilistic assumption, the input formula cannot be compiled directly into a differentiable loss but needs to be first translated into an equivalent deterministic and decomposable formula. While this step is necessary in order for the probabilistic modeling to be sound, the size of the resulting formula is usually exponential in the size of the grounded theory. On the other hand, in SBR, the formula can be compiled directly into a differentiable loss, whose size is linear in the size of the grounded theory. However, in order to do so, the semantics of the logic is altered, by turning it to fuzzy logic. Fuzzy logic can also be used to relax rules. For example, in LRNN, ∂ILP, DiffLog and [110], the scores of the proofs are computed by using fuzzy logic connectives. The great algebraic variance of the t-norm theory has allowed identifying parameterized (i.e. weighted) classes of t-norms [100, 89] that are very close to standard neural computation patterns (e.g. ReLU or sigmoidal layers). This creates an interesting, still not fully understood, connection between soft logical inference and inference in neural networks. A large class of methods [71, 24, 13, 112] relaxes logical statements numerically, giving no other specific semantics. Here, atoms are assigned scores in R computed by a neural scoring function over embeddings. Numerical approximations are then applied either to combine these scores according to logical formulas or to aggregate proofs scores. The resulting neural architecture is usually differentiable and, thus, trained end-to-end. As for PSL, some NeSy methods have used mixed probabilistic and fuzzy semantics. In particular, [68] extends PSL by adding neurally parameterized fac- tors to the Markov field, while [51] uses fuzzy logic to train posterior regularizers for standard deep networks using knowledge distillation techniques [49]. Fuzzy logic in NeSy is used mostly for computational reasons and not for an actual need to deal with vagueness. Indeed, all the fuzzy systems described in this survey starts from a Boolean theory, relaxes it to a fuzzy theory and, finally, return to Boolean logic to provide answers or to take decisions. We investigate this issue further in Section 9, however we want to define two common reasons to exploit fuzzy logic. The first one is to relax logical reasoning, and, in particular, 26
weighted SATisfability. This is actually made explicit in systems like LTNs. However, as we will show later, this causes the systems to output fuzzy solutions, which can be incoherent with the Boolean solution of the problem. The second reason is to approximate probabilistic inference, either by providing bounds [89] or by providing initialization for sampling techniques [3]. For example, PSL solves a fuzzy weighted SAT problem, similar to LTN, to find a fuzzy relaxation of the MAP state (see Example 14), which is then used as starting point for Markov Chain Monte Carlo (MCMC) inference. Dimension 4: Boolean, Probabilistic and Fuzzy logic This dimension concerns with the value assigned to atoms and formu- las of a logical theory. Boolean logic assigns values in the discrete set {T rue, F alse}, e.g. earthquake = T rue. Probabilistic logic al- lows computing the probability that an atom or a formula is T rue, e.g. p(earthquake = T rue) = 0.05. Fuzzy logic assigns soft truth degrees in the continuous set [0, 1], e.g. earthquake = 0.6. While probabilistic logic brings well-known computational challenges to probabilistic infer- ence, fuzzy logic introduces semantics issues when used as a relaxation of Boolean logic. This trade-off has not yet been clearly understood. 6 Learning: Structure versus Parameters The learning approaches in StarAI and NeSy can be broadly divided in two categories: structure [57] and parameter learning [44, 63]. In structure learning, we are interested in discovering a logical theory, a set of logical clauses and their corresponding probabilities, that reliably explain the given examples, starting from an empty model. What explaining the examples exactly means changes depending on the learning setting. In discriminative learning, we are interested in learning a theory that explains, or predicts, a specific target relation given background knowledge. In generative learning, there is no specific target relation; instead, we are interested in a theory that explains the interactions between all relations in a dataset. In contrast to structure learning, parameter learning starts with a deterministic logical theory and only learns the corresponding probabilities. The two modes of learning belong to vastly different complexity classes: structure learning is an inherently NP-complete problem of searching for the right combinatorial structure, whereas parameter learning can be achieved with any curve fitting technique, such as gradient descent or least-squares. While parameter learning is, in principle, an easier problem to solve, it comes with a strong dependency on the provided user input - if the provided clauses are of low quality, the resulting model will also be of low quality. Structure learning, on the other hand, depends less on the provided input, but is faced with an inherently more difficult problem. 27
You can also read