Knowledge Graph Representation - From Recent Models towards a Theoretical Understanding - From Recent Models ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Knowledge Graph Representation From Recent Models towards a Theoretical Understanding Ivana Balažević & Carl Allen January 27, 2021 School of Informatics, University of Edinburgh
What are Knowledge Graphs? A father of B f? le o married to f? c ro un he ot m D sibling C Entities E = {A, B, C , D} Relations R = {married to, father of, uncle of, ...} Knowledge Graph G = {(A, father of, B), (A, married to, C ), ...} 1
Representing Entities and Relations Subject and object entities es , eo are represented by vectors es , eo ∈ Rd (embeddings). 2
Representing Entities and Relations Subject and object entities es , eo are represented by vectors es , eo ∈ Rd (embeddings). 0 Relations r are represented by transformations fr , gr : Rd → Rd that transform the entity embeddings. 2
Representing Entities and Relations Subject and object entities es , eo are represented by vectors es , eo ∈ Rd (embeddings). 0 Relations r are represented by transformations fr , gr : Rd → Rd that transform the entity embeddings. A proximity measure, e.g. Euclidean distance, dot product, compares the transformed subject and object entities. 2
Representing Entities and Relations Subject and object entities es , eo are represented by vectors es , eo ∈ Rd (embeddings). 0 Relations r are represented by transformations fr , gr : Rd → Rd that transform the entity embeddings. A proximity measure, e.g. Euclidean distance, dot product, compares the transformed subject and object entities. (r ) eo (r ) es gr fr eo es (Edinburgh, capital of, Scotland) 2
Score Function A score function φ : E ×R×E → R brings together entity, relation representations and proximity measure to assign a score φ(es , r , eo ) to each triple, used to predict whether the triple is true or false. 3
Score Function A score function φ : E ×R×E → R brings together entity, relation representations and proximity measure to assign a score φ(es , r , eo ) to each triple, used to predict whether the triple is true or false. Representation parameters are optimised to improve prediction accuracy. 3
Score Function A score function φ : E ×R×E → R brings together entity, relation representations and proximity measure to assign a score φ(es , r , eo ) to each triple, used to predict whether the triple is true or false. Representation parameters are optimised to improve prediction accuracy. Score functions can be broadly categorised by: ä relation representation type (additive, multiplicative or both); and ä proximity measure (e.g. dot product, Euclidean distance). 3
Score Function A score function φ : E ×R×E → R brings together entity, relation representations and proximity measure to assign a score φ(es , r , eo ) to each triple, used to predict whether the triple is true or false. Representation parameters are optimised to improve prediction accuracy. Score functions can be broadly categorised by: ä relation representation type (additive, multiplicative or both); and ä proximity measure (e.g. dot product, Euclidean distance). Rel. Repr. Type Example φ(es , r , eo ) Model (r ) DistMult (Yang et al., 2015) Multiplicative e> s Wr eo = hes , eo i TuckER (Balažević et al., 2019b) Additive −kes +r−eo k2 TransE (Bordes et al., 2013) Both −ke> s > o 2 s Wr +r−eo Wr k + bs + bo MuRE (Balažević et al., 2019a) 3
TuckER: Tensor Factorization for Knowledge Graph Completion W Wr es de = es de dr wr de de eo eo Figure 1: Visualization of the TuckER architecture. φTuckER (es , r , eo ) = ((W ×1 wr )×2 es )×3 eo = e> s Wr eo 4
TuckER: Tensor Factorization for Knowledge Graph Completion W Wr es de = es de dr wr de de eo eo Figure 1: Visualization of the TuckER architecture. φTuckER (es , r , eo ) = ((W ×1 wr )×2 es )×3 eo = e> s Wr eo Multi-task learning: Rather than learning distinct relation matrices Wr , the core tensor W contains a shared pool of “prototype” relation matrices that are linearly combined using parameters of the relation embedding wr . (Balažević et al., 2019a) 4
MuRE: Multi-relational Euclidean Graph Embeddings z y x Figure 2: MuRE spheres of influence. φMuRE = −d(Res , eo + r)2 + bs + bo (Balažević et al., 2019b) 5
Recap ä KGs store facts: binary relations between entities (es , r , eo ). 6
Recap ä KGs store facts: binary relations between entities (es , r , eo ). ä Enable computational reasoning over KGs, e.g. question answering and inferring new facts (link prediction) 6
Recap ä KGs store facts: binary relations between entities (es , r , eo ). ä Enable computational reasoning over KGs, e.g. question answering and inferring new facts (link prediction) ä Requires representation, typically: • each entity by a vector embedding e ∈ Rd , eo • each relation by a transformation from es subject entity to object entity, Many, many models with increasing success, but no principled rationale as to why they work, or how to improve (e.g. better prediction, incorporate logic, etc).
Recap ä KGs store facts: binary relations between entities (es , r , eo ). ä Enable computational reasoning over KGs, e.g. question answering and inferring new facts (link prediction) ä Requires representation, typically: r (r ) es • each entity by a vector embedding e ∈ Rd , eo • each relation by a transformation from es subject entity to object entity, Many, many models with increasing success, but no principled rationale as to why they work, or how to improve (e.g. better prediction, incorporate logic, etc).
Recap ä KGs store facts: binary relations between entities (es , r , eo ). ä Enable computational reasoning over KGs, e.g. question answering and inferring new facts (link prediction) ä Requires representation, typically: r (r ) es • each entity by a vector embedding e ∈ Rd , eo • each relation by a transformation from es subject entity to object entity, Many, many models with increasing success, but no principled rationale as to why they work, or how to improve (e.g. better prediction, incorporate logic, etc). 6
Recap ä KGs store facts: binary relations between entities (es , r , eo ). ä Enable computational reasoning over KGs, e.g. question answering and inferring new facts (link prediction) ä Requires representation, typically: r (r ) es • each entity by a vector embedding e ∈ Rd , eo • each relation by a transformation from es subject entity to object entity, ä Many, many models with gradually increasing success, but no principled rationale for why they work, or how to improve them (e.g. more accurate prediction, incorporate logic, etc). 6
Simplify: consider Word Embeddings target context words (E) words (E) ä Word embeddings, e.g. w1 c1 w2 c2 • Word2Vec (W2V, Mikolov et al., 2013) w3 c3 • GloVe (Pennington et al., 2014) .. .. . . wn W C cn 7
Simplify: consider Word Embeddings target context words (E) words (E) ä Word embeddings, e.g. w1 c1 w2 c2 • Word2Vec (W2V, Mikolov et al., 2013) w3 c3 • GloVe (Pennington et al., 2014) .. .. . . wn W C cn ä Observation: semantic relations =⇒ geometric relationships between words between embeddings 7
Simplify: consider Word Embeddings target context words (E) words (E) ä Word embeddings, e.g. w1 c1 w2 c2 • Word2Vec (W2V, Mikolov et al., 2013) w3 c3 • GloVe (Pennington et al., 2014) .. .. . . wn W C cn ä Observation: semantic relations =⇒ geometric relationships between words between embeddings • similar words ⇒ close embeddings 7
Simplify: consider Word Embeddings target context words (E) words (E) ä Word embeddings, e.g. w1 c1 w2 c2 • Word2Vec (W2V, Mikolov et al., 2013) w3 c3 • GloVe (Pennington et al., 2014) .. .. . . wn W C cn ä Observation: semantic relations =⇒ geometric relationships between words between embeddings • similar words ⇒ close embeddings wwoman + wking − wman • analogies (often) ⇒ wking ≈ wqueen wwoman man w 7
Simplify: consider Word Embeddings target context words (E) words (E) ä Word embeddings, e.g. w1 c1 w2 c2 • Word2Vec (W2V, Mikolov et al., 2013) w3 c3 • GloVe (Pennington et al., 2014) .. .. . . wn W C cn ä Observation: semantic relations =⇒ geometric relationships between words between embeddings • similar words ⇒ close embeddings wwoman + wking − wman • analogies (often) ⇒ wking ≈ wqueen wwoman man w ä Aim: relate the understanding of this to knowledge graph relations 7
Understanding word embeddings: the W2V Loss Function k#(wi )#(cj ) X −`W 2V = #(wi , cj ) log σ(wi> cj ) + D log(σ(−wi> cj )) i,j 8
Understanding word embeddings: the W2V Loss Function k#(wi )#(cj ) X −`W 2V = #(wi , cj ) log σ(wi> cj ) + D log(σ(−wi> cj )) i,j X σ(Si,j ) − σ(wi> cj ) cj = C diag(d(i))e(i) ∇wi `W 2V ∝ p(wi , cj )+kp(wi )p(cj ) j | {z } | {z } (i) (i) dj ej 8
Understanding word embeddings: the W2V Loss Function k#(wi )#(cj ) X −`W 2V = #(wi , cj ) log σ(wi> cj ) + D log(σ(−wi> cj )) i,j X σ(Si,j ) − σ(wi> cj ) cj = C diag(d(i))e(i) ∇wi `W 2V ∝ p(wi , cj )+kp(wi )p(cj ) j | {z } | {z } (i) (i) dj ej • `W 2V minimised when: p(c |w ) . low-rank case: wi> cj = log p(cj j )i − log k = Si,j (Levy and Goldberg, 2014) | {z } PMI(wi , cj ) 8
Understanding word embeddings: the W2V Loss Function k#(wi )#(cj ) X −`W 2V = #(wi , cj ) log σ(wi> cj ) + D log(σ(−wi> cj )) i,j X σ(Si,j ) − σ(wi> cj ) cj = C diag(d(i))e(i) ∇wi `W 2V ∝ p(wi , cj )+kp(wi )p(cj ) j | {z } | {z } (i) (i) dj ej • `W 2V minimised when: p(c |w ) . low-rank case: wi> cj = log p(cj j )i − log k = Si,j (Levy and Goldberg, 2014) | {z } PMI(wi , cj ) general case: error vectors diag(d(i) )e(i) orthogonal to rows of C 8
Understanding word embeddings: the W2V Loss Function k#(wi )#(cj ) X −`W 2V = #(wi , cj ) log σ(wi> cj ) + D log(σ(−wi> cj )) i,j X σ(Si,j ) − σ(wi> cj ) cj = C diag(d(i))e(i) ∇wi `W 2V ∝ p(wi , cj )+kp(wi )p(cj ) j | {z } | {z } (i) (i) dj ej • `W 2V minimised when: p(c |w ) . low-rank case: wi> cj = log p(cj j )i − log k = Si,j (Levy and Goldberg, 2014) | {z } PMI(wi , cj ) general case: error vectors diag(d(i) )e(i) orthogonal to rows of C ⇒ Embedding wi is a (non-linear) projection of row i of the PMI matrix*, a PMI vector pi . (* drop k term as artefact of the W2V algorithm.) 8
PMI Vectors p(c |w ) pi = = log p(E|w i) log p(cj j )i cj ∈E p(E) (E = dictionary of all words) Figure 3: The PMI surface S with example PMI vectors of words (red dots) 9
PMI Vector Interactions = Semantics (Similarity) Similarity: similar words, e.g. synonyms, induce similar distributions, p(E|w ), over context words. 10
PMI Vector Interactions = Semantics (Similarity) Similarity: similar words, e.g. synonyms, induce similar distributions, p(E|w ), over context words. Identified by subtraction of PMI vectors: p(E|wi ) pi − pj = log p(E|w j) = ρi,j 10
PMI Vector Interactions = Semantics (Similarity) Similarity: similar words, e.g. synonyms, induce similar distributions, p(E|w ), over context words. Identified by subtraction of PMI vectors: p(E|wi ) pi − pj = log p(E|w j) = ρi,j p(E|hound) p(E|dog) pdog phound =⇒ w1 wn E 10
PMI Vector Interactions = Semantics (Paraphrase) Paraphrases: word sets with similar aggregate semantic meaning, e.g. {man, royal} ≈ king. 11
PMI Vector Interactions = Semantics (Paraphrase) Paraphrases: word sets with similar aggregate semantic meaning, e.g. {man, royal} ≈ king. Identified by addition of PMI vectors: p(E|wj ) pi + pj = log p(E|w i) p(E) + log p(E) p(E|w ,w ) p(w ,w |E) p(w ,w ) = pk + log p(E|wi k )j − log p(wi |E)p(w i j j |E) i + log p(wi )p(wj j) | {z } | {z } | {z } ρ {i,j},k σ i,j τ i,j | {z } | {z } paraphrase error independence error 11
PMI Vector Interactions = Semantics (Paraphrase) Paraphrases: word sets with similar aggregate semantic meaning, e.g. {man, royal} ≈ king. Identified by addition of PMI vectors: p(E|wj ) pi + pj = log p(E|w i) p(E) + log p(E) p(E|w ,w ) p(w ,w |E) p(w ,w ) = pk + log p(E|wi k )j − log p(wi |E)p(w i j j |E) i + log p(wi )p(wj j) | {z } | {z } | {z } ρ {i,j},k σ i,j τ i,j | {z } | {z } paraphrase error independence error pman p{man, royal} p(E|king) p(E|{man, royal}) pking =⇒ w1 wn E proyal 11
PMI Vector Interactions = Semantics (Analogy) Analogies: word pairs that share a similar semantic difference, e.g. {man, king} and {woman, queen}. 12
PMI Vector Interactions = Semantics (Analogy) Analogies: word pairs that share a similar semantic difference, e.g. {man, king} and {woman, queen}. Identified by a linear combination of PMI vectors: pking − pman ≈ pqueen − pwoman 12
PMI Vector Interactions = Semantics (Analogy) Analogies: word pairs that share a similar semantic difference, e.g. {man, king} and {woman, queen}. Identified by a linear combination of PMI vectors: pking − pman ≈ pqueen − pwoman pqueen pking pwoman pman (Allen and Hospedales, 2019; Allen et al., 2019) 12
From Analogies to Relations pqueen pking ≈ ≈ ⇔ pwoman pman + + man king woman queen Analogy Relation 13
From Analogies to Relations pqueen pking ≈ ≈ ⇔ pwoman pman + + man king woman queen Analogy Relation ä Analogies contain common binary word relations, similar to KGs. 13
From Analogies to Relations pqueen pking ≈ ≈ ⇔ pwoman pman + + man king woman queen Analogy Relation ä Analogies contain common binary word relations, similar to KGs. ä For certain analogies (“specialisations”), the associated “vector offset” gives a transformation that represents the relation. 13
From Analogies to Relations pqueen pking ≈ ≈ ⇔ pwoman pman + + man king woman queen Analogy Relation ä Analogies contain common binary word relations, similar to KGs. ä For certain analogies (“specialisations”), the associated “vector offset” gives a transformation that represents the relation. ä Not all relations fit this semantic pattern, but we have insight to consider geometric aspects (relation conditions) of other relation types. 13
Categorising Relations: semantics → relation requirements ≈ ≈ ≈ ≈ ≈ Similarity Relatedness Specialisation Context-shift Gen. context-shift Relationships between PMI vectors for different relation types. blue/green = strong word association (PMI> 0); red = relatedness; black = context sets 14
Categorising Relations: semantics → relation requirements ≈ ≈ ≈ ≈ ≈ Similarity Relatedness Specialisation Context-shift Gen. context-shift Relationships between PMI vectors for different relation types. blue/green = strong word association (PMI> 0); red = relatedness; black = context sets Categorisation of WN18RR relations. Type Relation Examples (subject entity, object entity) verb group (trim down VB 1, cut VB 35), (hatch VB 1, incubate VB 2) R derivationally related form (lodge VB 4, accommodation NN 4), (question NN 1, inquire VB 1) also see (clean JJ 1, tidy JJ 1), (ram VB 2, screw VB 3) hypernym (land reform NN 1, reform NN 1), (prickle-weed NN 1, herbaceous plant NN 1) S instance hypernym (yellowstone river NN 1, river NN 1), (leipzig NN 1, urban center NN 1) member of domain usage (colloquialism NN 1, figure VB 5), (plural form NN 1, authority NN 2) member of domain region (rome NN 1, gladiator NN 1), (usa NN 1, multiple voting NN 1) C member meronym (south NN 2, sunshine state NN 1), (genus carya NN 1, pecan tree NN 1) has part (aircraft NN 1, cabin NN 3), (morocco NN 1, atlas mountains NN 1) synset domain topic of (quark NN 1, physics NN 1), (harmonize VB 3, music NN 4) 14
Categorical completeness: are all relations covered? ä View PMI vectors as sets of word features and relation types as set operations: • similarity ⇒ set equality • relatedness ⇒ subset equality (relation-specific) • context-shift ⇒ set difference (relation-specific) 15
Categorical completeness: are all relations covered? ä View PMI vectors as sets of word features and relation types as set operations: • similarity ⇒ set equality • relatedness ⇒ subset equality (relation-specific) • context-shift ⇒ set difference (relation-specific) ä For any relation, each feature is either • necessarily unchanged (relatedness), • necessarily/potentially changed (context shift), or • irrelevant. 15
Categorical completeness: are all relations covered? ä View PMI vectors as sets of word features and relation types as set operations: • similarity ⇒ set equality • relatedness ⇒ subset equality (relation-specific) • context-shift ⇒ set difference (relation-specific) ä For any relation, each feature is either • necessarily unchanged (relatedness), • necessarily/potentially changed (context shift), or • irrelevant. ä Conjecture: the relation types identified partition the set of semantic relations. 15
Relations as mappings between embeddings R: S-relatedness requires both entity embeddings es , eo to share a common subspace component VS ä project onto VS (multiply by matrix Pr ∈ Rd×d ) and compare. ä Dot product: (Pr es )> (Pr eo ) = es> Pr> Pr eo = es> Mr eo ä Euclidean distance: kPr es −Pr eo k2 = kPr es k2 − 2es> Mr eo + kPr eo k2 S/C: requires S-relatedness and relation-specific component(s) (vrs , vro ). ä project onto a subspace (by Pr ∈ Rd×d ) corresponding to S, vrs and vro (i.e. test S-relatedness while preserving relation-specific components); ä add relation-specific r = vro − vrs ∈ Rd to transformed embeddings. ä Dot product: (Pr es + r )> Pr eo ä Euclidean distance: kPr es + r − Pr eo k2 (cf MuRE: kRes + r − eo k2 ) 16
Summary ä Theoretic: a derivation of geometric components of relation representations from word co-occurrence statistics. 17
Summary ä Theoretic: a derivation of geometric components of relation representations from word co-occurrence statistics. ä Interpretability: associates geometric model components with semantic aspects of relations. 17
Summary ä Theoretic: a derivation of geometric components of relation representations from word co-occurrence statistics. ä Interpretability: associates geometric model components with semantic aspects of relations. ä Empirically supported: justifies relative link-prediction performance of a range of models on real datasets: 17
Summary ä Theoretic: a derivation of geometric components of relation representations from word co-occurrence statistics. ä Interpretability: associates geometric model components with semantic aspects of relations. ä Empirically supported: justifies relative link-prediction performance of a range of models on real datasets: additive & multiplicative > multiplicative or additive . | {z } | {z } | {z } MuRE* (Balažević et al., 2019a) TuckER (Balažević et al., 2019b) TransE (Bordes et al., 2013) DistMult (Yang et al., 2015) *Note: MuRE was inspired by the vector offset of analogies. Work to appear in ICLR 2021 (Allen et al., 2021). 17
Thanks! Any questions? 18
References i Carl Allen and Timothy Hospedales. Analogies Explained: Towards Understanding Word Embeddings. In ICML, 2019. Carl Allen, Ivana Balažević, and Timothy Hospedales. What the Vec? Towards Probabilistically Grounded Embeddings. In NeurIPS, 2019. Carl Allen, Ivana Balažević, and Timothy Hospedales. Interpreting Knowledge Graph Relation Representation from Word Embeddings. In ICLR, 2021. Ivana Balažević, Carl Allen, and Timothy M Hospedales. Multi-relational Poincaré Graph Embeddings. In NeurIPS, 2019a. Ivana Balažević, Carl Allen, and Timothy M Hospedales. TuckER: Tensor Factorization for Knowledge Graph Completion. In EMNLP, 2019b. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating Embeddings for Modeling Multi-relational Data. In NeurIPS, 2013. Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In NeurIPS, 2014. 19
References ii Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR Workshop, 2013. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global Vectors for Word Representation. In EMNLP, 2014. Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. In ICLR, 2015. 20
You can also read