FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
applied sciences Article FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings † Ahmad Mel , Bo Kang , Jefrey Lijffijt and Tijl De Bie * AIDA, IDLab-ELIS, Ghent University, 9052 Ghent, Belgium; ahmad.mel@ugent.be (A.M.); bo.kang@ugent.be (B.K.); jefrey.lijffijt@ugent.be (J.L.) * Correspondence: tijl.debie@ugent.be † This paper is an extended version of our paper published in IEEE DSAA 2020 The 7th IEEE International Conference on Data Science and Advanced Analytics. Featured Application: FONDUE can be used to preprocess graph structured data, in particular it facilitates detecting nodes in the graph that represent the same real-life entity, and for detecting and optimally splitting nodes that represent multiple distinct real-life entities. FONDUE does this in an entirely unsupervised fashion, relying exclusively on the topology of the network. Abstract: Data often have a relational nature that is most easily expressed in a network form, with its main components consisting of nodes that represent real objects and links that signify the relations between these objects. Modeling networks is useful for many purposes, but the efficacy of downstream tasks is often hampered by data quality issues related to their construction. In many constructed networks, ambiguity may arise when a node corresponds to multiple concepts. Similarly, a single entity can be mistakenly represented by several different nodes. In this paper, we Citation: Mel, A.; Kang, B.; Lijffijt, J.; formalize both the node disambiguation (NDA) and node deduplication (NDD) tasks to resolve these De Bie, T. FONDUE: A Framework data quality issues. We then introduce FONDUE, a framework for utilizing network embedding for Node Disambiguation and methods for data-driven disambiguation and deduplication of nodes. Given an undirected and Deduplication Using Network unweighted network, FONDUE-NDA identifies nodes that appear to correspond to multiple entities Embeddings. Appl. Sci. 2021, 11, 9884. for subsequent splitting and suggests how to split them (node disambiguation), whereas FONDUE- https://doi.org/10.3390/ NDD identifies nodes that appear to correspond to same entity for merging (node deduplication), app11219884 using only the network topology. From controlled experiments on benchmark networks, we find Academic Editors: Paola Velardi and that FONDUE-NDA is substantially and consistently more accurate with lower computational cost Stefano Faralli in identifying ambiguous nodes, and that FONDUE-NDD is a competitive alternative for node deduplication, when compared to state-of-the-art alternatives. Received: 2 August 2021 Accepted: 18 October 2021 Keywords: node disambiguation; node deduplication; node linking; entity linking; network embed- Published: 22 October 2021 dings; representation learning Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affil- 1. Introduction iations. Increasingly, collected data naturally comes in the form of a network of interre- lated entities. Examples include social networks describing social relations between people (e.g., Facebook), citation networks describing the citation relations between pa- pers (e.g., PubMed [1]), biological networks, such as those describing interactions between Copyright: © 2021 by the authors. proteins (e.g., DIP [2]), and knowledge graphs describing relations between concepts or Licensee MDPI, Basel, Switzerland. objects (e.g., DBPedia [3]). Thus, new machine learning, data mining, and information This article is an open access article retrieval methods are increasingly targeting data in their native network representation. distributed under the terms and An important problem across all the fields of data science, broadly speaking, is data conditions of the Creative Commons quality. For problems on networks, especially those that are successful in exploiting fine- as Attribution (CC BY) license (https:// well as coarse-grained structure of networks, ensuring good data quality is perhaps even creativecommons.org/licenses/by/ more important than in standard tabular data. For example, an incorrect edge can have 4.0/). Appl. Sci. 2021, 11, 9884. https://doi.org/10.3390/app11219884 https://www.mdpi.com/journal/applsci
Appl. Sci. 2021, 11, 9884 2 of 28 a dramatic effect on the implicit representation of other nodes, by dramatically changing distances on the network. Similarly, mistakenly representing distinct real-life entities by the same node in the network may dramatically alter its structural properties, by increasing the degree of the node and by merging the possibly quite distinct neighborhoods of these entities into one. Conversely, representing the same real-life entity by multiple nodes can also negatively affect the topology of the graph, possibly even splitting apart communities. Although identifying missing edges and, conversely, identifying incorrect edges, can be tackled adequately using link prediction methods, prior work has neglected the other task: identifying and appropriately splitting nodes that are ambiguous—i.e., nodes that correspond to more than one real-life entity. We will refer to this task as node disambiguation (NDA). A converse and equally important problem is the problem of identifying multi- ple nodes corresponding to the same real-life entity,a problem we will refer to as node deduplication (NDD). This paper proposes a unified and principled framework to both NDA and NDD problems, called framework for node disambiguation and deduplication using network embeddings (FONDUE). FONDUE is inspired by the empirical observation that real (natu- ral) networks tend to be easier to embed than artificially generated (unnatural) networks, and rests on the associated hypothesis that the existence of ambiguous or duplicate nodes makes a network less natural. Although most of the existing methods tackling NDA and NDD make use of additional information (e.g., node attributes, descriptions, or labels) for identifying and processing these problematic nodes, FONDUE adopts a more widely applicable approach that relies solely on topological information. Although exploiting additional information may of course increase the accuracy on those tasks, we argue that a method that does not require such information offers unique advantages, e.g., when data availability is scarce, or when building an extensive dataset on top of the graph data, is not feasible for practical reasons. Additionally, this approach fits the privacy by design framework, as it eliminates the need to incorporate more sensitive data. Finally, we argue that, even in cases where such additional information is available, it is both of scientific and of practical interest to explore how much can be completed without using it, instead solely relying on the network topology. Indeed, although this is beyond the scope of the current paper, it is clear that methods that solely rely on network topology could be combined with methods that exploit additional node-level information, plausibly leading to improved performance of either type of approach individually. 1.1. The Node Disambiguation Problem We address the problem of NDA in the most basic setting: given a network, un- weighted, unlabeled, and undirected, the task considered is to identify nodes that cor- respond to multiple distinct real-life entities. We formulate this as an inverse problem, where we use the given ambiguous network (which contains ambiguous nodes) in order to retrieve the unambiguous network (in which all nodes are unambiguous). Clearly, this inverse problem is ill-posed, making it impossible to solve without additional information (which we do not want to assume) or an inductive bias. The key insight in this paper is that such an inductive bias can be provided by the network embedding (NE) literature. This literature has produced embedding-based models that are capable of accurately modeling the connectivity of real-life networks down to the node-level, while being unable to accurately model random networks [4,5]. Inspired by this research, we propose to use as an inductive bias the fact that the unambiguous network must be easy to model using a NE. Thus, we introduce FONDUE-NDA, a method that identifies nodes as ambiguous if, after splitting, they maximally improve the quality of the resulting NE. Example 1. Figure 1a illustrates the idea of FONDUE for NDA applied on a single node. In this example, node i with embedding xi corresponds to two real-life entities that belong to two separate
Appl. Sci. 2021, 11, 9884 3 of 28 communities, visualized by either full or dashed lines, to highlight the distinction. Because node i is connected to two different communities, most NE methods would locate its embedding xi between the embeddings of the nodes from both communities. Figure 1b shows a split of node i into nodes i0 and i00 , each with connections only to one of both communities. The resulting network is easy to embed by most NE methods, with embeddings xi0 and xi00 close to their own respective communities. In contrast, Figure 1c shows a split where the two resulting nodes are harder to embed. Most NE methods would embed them between both communities, but substantial tension would remain, resulting in a worse value of the NE objective function. Figure 1. (a) A node that corresponds to two real-life entities that belongs to two communities. Links that connect the node with different communities are plotted in either full lines or dashed lines. (b) an ideal split that aligns well with the communities. (c) a less optimal split. 1.2. The Node Deduplication Problem The same inductive bias can be used also for the NDD problem. The NDD problem is that given a network, unweighted, unlabeled, and undirected, identify distinct nodes that correspond to the same real-life entity. To this end, FONDUE-NDD determines how well merging two given nodes into one would improve the embedding quality of NE models. The inductive bias considers a merge as better than another one if it results in a better value of the NE objective function. The diagram in Figure 2 shows the suggested pipeline for tackling both problems. Data Sources • Structured data • Documents • Graph data • Etc … Problem: Node Ambiguation Problem: Node Duplication Data Corruption • Data Collection • Data Processing splitting contraction FONDUE • Help Identify Corrupted Nodes in the graph Task: Node Disambiguation Task: Node Deduplication FONDUE-NDA FONDUE-NDD Figure 2. FONDUE pipeline for both NDA and NDD. Data corruption can lead to two types of problems: node ambiguation (e.g., multiple authors sharing the same name represented with one node in the network) in the left part of the diagram, and node duplication (e.g., one author with name variation represented by more than 1 node in the network). We then define two tasks to resolve both problems separately using FONDUE.
Appl. Sci. 2021, 11, 9884 4 of 28 1.3. Contributions In this paper, we make a number of related contributions: • We propose FONDUE, a framework exploiting the empirical observation that naturally occurring networks can be embedded well using state-of-the-art NE methods, to tackle two distinct tasks: node deduplication (FONDUE-NDD) and node disambiguation (FONDUE-NDA). The former, by identifying nodes as more likely to be duplicated if contracting them enhances the quality of an optimal NE. The latter, by identifying nodes as more likely to be ambiguous if splitting them enhances the quality of an optimal NE; • In addition this conceptual contribution, substantial challenges had to be overcome to implement this idea in a scalable manner. Specifically for the NDA problem, through a first-order analysis we derive a fast approximation of the expected NE quality improvement after splitting a node; • We implemented this idea for CNE [6], a recent state-of-the-art NE method, although we demonstrate that the approach can be applied for a broad class of other NE methods as well; • We tackle the NDA problem, with extensive experiments over a wide range of net- works demonstrate the superiority of FONDUE over the state-of-the-art for the identi- fication of ambiguous nodes, and this at comparable computational cost; • We also empirically observe that, somewhat surprisingly, despite the increase in accuracy for identifying ambiguous nodes, no such improvement was observed for the ambiguous node splitting accuracy. Thus, for NDA, we recommend using FONDUE for the identification of ambiguous nodes, while using an existing state-of-the-art approach for optimally splitting them; • Experiments on four datasets for NDD demonstrate the viability of FONDUE-NDD for the NDD problem based on only the topological features of a network. 2. Related Work The problem of NDA differs from named-entity disambiguation (NED; also known as named entity linking), a natural language processing (NLP) task where the purpose is to identify which real-life entity from a list a named-entity in a text refers to. For example, in the ArnetMiner dataset [7] ‘Bin Zhu’ corresponds to more than 10+ authors. The Open Researcher and Contributor ID (ORCID) [8] was introduced to solve the author name ambiguity problem, and most NED methods rely on ORCID for labeling datasets. NED in this context aims to match the author names to unique (unambiguous) author identifiers [7,9–11]. In [7], they exploit hidden Markov random fields in a unified probabilistic framework to model node and edge features. On the other hand, Zhang et al. [12] designed a com- prehensive framework to tackle name disambiguation, using complex feature engineering approach. By constructing paper networks, using the information sharing between two papers to build a supervised model for assigning the weights of the edges of the paper network. If two nodes in the network are connected, they are more likely to be authored by the same person. Recent approaches are increasingly relying on more complex data, Ma et al. [13] used heterogeneous bibliographic networks representation learning, by employing relational and paper-related textual features, to obtain the embeddings of multiple types of nodes, while using meta-path based proximity measures to evaluate the neighboring and structural similarities of node embedding in the heterogeneous graphs. The work of Zhang et al. [9] focusing on preserving privacy using solely the link information in a graph, employs network embedding as an intermediate step to perform NED, but they rely on other networks (person–document and document–document) in addition to person–person network to perform the task. Although NDA could be used to assist in NED tasks, NED typically strongly relies on the text, e.g., by characterizing the context in which the named entity occurs (e.g., paper
Appl. Sci. 2021, 11, 9884 5 of 28 topic) [14]. Similarly, Ma et al. [15] proposes a name disambiguation model based on representation learning employing attributes and network connections, by first encoding the attributes of each paper using variational graph auto-encoder, then computing a similarity metric from the relationship of these attributes, and then using graph embedding to leverage the author relationships, heavily relying on NLP. In NDA, in contrast, no natural language is considered, and the goal is to rely on just the network’s connectivity in order to identify which nodes may correspond to multiple distinct entities. Moreover, NDA does not assume the availability of a list of known unambiguous entity identifiers, such that an important part of the challenge is to identify which nodes are ambiguous in the first place. This offers a more privacy-friendly advantage and extends the application towards more datasets where access to additional information is restricted or not possible. The research by Saha et al. [16], and Hermansson et al. [17] is most closely related to ours. These papers also only use topological information of the network for NDA. Yet, Ref. [16] also require timestamps for the edges, while [17] require a training set of nodes labeled as ambiguous and non-ambiguous. Moreover, even though the method proposed by [16] is reportedly orders of magnitude faster than the one proposed by [17], it remains computationally substantially more demanding than FONDUE (e.g., [16] evaluate their method on networks with just 150 entities). Other recent work using NE for NED [9,18–20] is only related indirectly as they rely on additional information besides the topology of the network. The literature on NDD is scarce, as the problem is not well-defined. Conceptually, it is similar to that of named entity linking (NEL) [11,21] problem which aims to link instances of named entities in a text such as a newspaper, articles to the corresponding entities, often in knowledge bases (KB). Consequently, NEL heavily relies on textual data to identify erroneous entities rather than entity connection which is the core of our method. KB approaches for NEL are dominant in the field [22,23], as they make use of knowledge base datasets, heavily relying on labeled and additional graph data to tackle the named entity linking task. This also poses a challenge when it comes to benchmarking our method for NDD. No identified studies that tackles NDD from a topological approach is present in the current literature, at least without reliance on additional attributes and features. 3. Methods Section 3.1 formally defines the NDA and NDD problems. Section 3.2 introduces the FONDUE framework in a maximally generic manner, independent of the specific NE method it is applied to, or the task (NDD or NDE) it is used for. A scalable approximation of FONDUE-NDA is described throughout Section 3.3, and applied to CNE as a specific NE method. Section 3.4 details the FONDUE-NDD method used for NDD. Throughout this paper, a bold uppercase letter denotes a matrix (e.g., A), a bold lower case letter denotes a column vector (e.g., xi ), (.)> denotes matrix transpose (e.g., A> ), and k.k denotes the Frobenius norm of a matrix (e.g., k Ak). 3.1. Problem Definition We denote an undirected, unweighted, unlabeled graph as G = (V, E), with V = {1, 2, . . . , n} the set of n nodes (or vertices), and E ⊆ (V2 ) the set of edges (or links) between these nodes. We also define the adjacency matrix of a graph G , denoted A ∈ {0, 1}n×n , with Aij = 1 if {i, j} ∈ E. We denote ai ∈ {0, 1}n as the adjacency vector for node i, i.e., the ith column of the adjacency matrix A, and Γ(i ) = { j | {i, j} ∈ E} the set of neighbors of i. 3.1.1. Formalizing the Node Disambiguation Problem To formally define the NDA problem as an inverse problem, we first need to define the forward problem which maps an unambiguous graph onto an ambiguous one. This formalizes the ‘corruption’ process that creates ambiguity in the graph. In practice, this happens most often because identifiers of the entities represented by the nodes are not
Appl. Sci. 2021, 11, 9884 6 of 28 unique. For example, in a co-authorship network, the identifiers could be non-unique author names. To this end, we define a node contraction: Definition 1 (Node Contraction). A node contraction c for a graph G = (V, E) with V = {1, 2, . . . , n} is a surjective function c : V → V̂ for some set V̂ = {1, 2, . . . , n̂} with n̂ ≤ n. For convenience, we will define c−1 : V̂ → 2V as c−1 (i ) = {k ∈ V |c(k ) = i } for any i ∈ V̂. Moreover, we will refer to the cardinality |c−1 (i )| as the multiplicity of the node i ∈ V̂. A node contraction defines an equivalence relation ∼c over the set of nodes: i ∼c j if c(i ) = c( j), and the set V̂ is the quotient set V/ ∼c . Continuing our example of a co- authorship network, a node contraction maps an author onto the node representing their name. Two authors i and j would be equivalent if their names c(i ) and c( j) are equal, and the multiplicity of a node is the number of distinct authors with the corresponding name. We can naturally define the concept of an ambiguous graph in terms of the contraction operation, as follows. Definition 2 (Ambiguous graph). Given a graph G = (V, E) and a node contraction c for that graph, the graph Ĝ = (V̂, Ê) defined as Ê = {{c(k ), c(l )}|{k, l } ∈ E} is referred to as an ambiguous graph of G . Overloading notation, we may write Ĝ , c(G). To contrast G with Ĝ , we may refer to G as the unambiguous graph. Continuing the example of the co-authorship network, the contraction operation can be thought of as the operation that replaces author identities with their names, which may map distinct authors onto a same shared name. Note that the symbols for the ambiguous graph and its set of nodes and edges are denoted here using hats, to indicate that in the NDA problem we are interested in situations where the ambiguous graph is the empirically observed graph. We can now formally define the NDA problem as inverting this contraction operation: Definition 3 (The Node Disambiguation Problem). Given an ambiguous graph Ĝ = (V̂, Ê), NDA aims to retrieve the unambiguous graph G = (V, E) and associated node contraction c, i.e., a contraction c for which c(G) = Ĝ . To be more precise, it suffices to identify G up to an isomorphism, as the actual identifiers of the nodes are irrelevant. 3.1.2. Formalizing the Node Deduplication Problem The NDD problem can be formalized as the converse of the NDA problem, also relying on the concept of node contractions. First, a duplicate graph can be defined as follows: Definition 4 (Duplicate graph). Given a graph G = (V, E), a graph Ĝ = (V̂, Ê) where {k, l } ∈ Ê ⇒ {c(k), c(l )} ∈ E for an appropriate contraction c, and where for each {i, j} ∈ E there exists an edge {k, l } ∈ Ê for which c(k) = i and c(l ) = j, is referred to as a duplicate graph of G . Or more concisely, using the overloaded notation from Definition 2, a duplicate graph Ĝ is a graph for which c(Ĝ) = G . To contrast G with Ĝ , we may refer to G as the deduplicated graph. Continuing the example of the co-authorship network, one node in the duplicate graph could correspond to two versions of the name of the same author, such that they are assigned two different nodes in the duplicate graph. A contraction operation that maps duplicate names to their common identity would merge such nodes corresponding to the same author. Hats on top of the symbols of the duplicate graph indicate that in the NDD problem we are interested in the situation where the duplicate graph is the empirically observed one. The NDD problem can, thus, be formally defined as follows:
Appl. Sci. 2021, 11, 9884 7 of 28 Definition 5 (The Node Deduplication Problem). Given a duplicate graph Ĝ = (V̂, Ê), NDD aims to retrieve the deduplicated graph G = (V, E) and the node contraction c associated with Ĝ , i.e., for which G = c(Ĝ). 3.1.3. Real Graphs Suffer from Both Issues Of course, many real graphs both require deduplication and disambiguation. This is particularly true for the running example of the co-authorship network. Yet, while building on the common FONDUE framework, we define and study both problems separately, and propose an algorithm for each in Section 3.3 (for NDA) and Section 3.4 (for NDD). For networks suffering from both problems, both algorithms can be applied concurrently or sequentially without difficulties, thus solving both problems simultaneously. 3.2. FONDUE as a Generic Approach To address both the NDA and NDD problems, FONDUE uses an inductive bias that the non-corrupted (unambiguous and deduplicated) network must be easy to model using NE. This allows us to approach both problems in the context of NE. Here we first formalize the inductive bias of FONDUE (Section 3.2). This will later allow us to present both the FONDUE-NDA (Section 3.3) and FONDUE-NDD (Section 3.4) algorithms, each tackling one of the data corruption tasks (NDA and NDD, respectively). The FONDUE Induction Bias Clearly, both the NDA and NDD problems are inverse problems, with NDA an ill- posed one. Thus, further assumptions, inductive bias, or priors are inevitable in order to solve them. The key hypothesis in FONDUE is that the unambiguous and deduplicated G , considering it is a ‘natural’ graph, can be embedded well using state-of-the-art NE methods. This hypothesis is inspired by the empirical observation that NE methods embed ‘natural’ graphs well. NE methods find a mapping f : V → Rd from nodes to d-dimensional real vectors. An embedding is denoted as X = (x1 , x2 , . . . , xn )> ∈ Rn×d , where xi , f (i ) for i ∈ V is the embedding of each node. Most well-known NE methods aim to find an optimal embedding XG∗ for given graph G that minimizes a continuous differentiable cost function O(G , X ). Thus, given an ambiguous graph Ĝ , FONDUE-NDA will search for the graph G , such that c(G) = Ĝ for an appropriate contraction c, while optimizing the NE cost function on G : Definition 6 (NE-based NDA problem). Given an ambiguous graph Ĝ , NE-based NDA aims to retrieve the unambiguous graph G and the associated contraction c: argmin O G , XG∗ G (1) s.t. c(G) = Ĝ for some contraction c. Ideally, this optimization problem can be solved by simultaneously finding optimal splits for all nodes (i.e., an inverse of the contraction c) that yield the smallest embedding cost after re-embedding. However, this strategy requires to (a) search splits in an exponen- tial search space that has the combinations of splits (with arbitrary cardinality) of all nodes, (b) to evaluate each combination of the splits, the embedding of the resulting network needs to be recomputed. Thus, this ideal solution is computationally intractable and more scalable solutions are needed (see Section 3.3). Similarly, for NDD, given a duplicate graph Ĝ , FONDUE-NDD will search for a graph G , such that c(Ĝ) = G for an appropriate contraction c, again while optimizing the NE cost function on G :
Appl. Sci. 2021, 11, 9884 8 of 28 Definition 7 (NE-based NDD problem). Given a duplicate graph Ĝ , NE-based NDD aims to retrieve the deduplicated graph G and the associated contraction c of Ĝ : argmin O G , XG∗ G (2) s.t. c(Ĝ) = G for some contraction c. Generally speaking, to solve this optimization problem, we would want to find the optimal merging for all the nodes that would reduce the cost of the embedding after computing the re-embedding. Yet, a thorough optimization of this problem is beyond the scope of this paper, and as an approximation we rely on a ranking-based approach where we rank networks with randomly merged nodes depending on the value of the objective function after re-embedding. This may be suboptimal, but it highlights the viability of the concept if used for NDD as shown in the results of the experiments. Although the principle underlying both methods is thus very similar, we will see below that the corresponding methods differ considerably. In common to them is the need for a basic understanding of NE methods. 3.3. FONDUE-NDA From the above section, it is clear that the NDA problem can be decomposed into two subproblems: 1. Estimating the multiplicities of all i ∈ Ĝ —i.e., the number of unambiguous nodes from G represented by the node from Ĝ . This essentially amounts to estimation the contraction c. Note that the number of nodes n in V is then equal to the sum of these multiplicities, and arbitrarily assigning these n nodes to the sets c−1 (i ) defines c−1 and, thus, c; 2. Given c, estimating the edge set E. To ensure that c(G) = Ĝ , for each {i, j} ∈ Ê there must exist at least one edge {k, l } ∈ E with k ∈ c−1 (i ) and l ∈ c−1 ( j). However, this leaves the problem underdetermined (making this problem ill-posed), as there may also exist multiple such edges. As an inductive bias for the second step, we will additionally assume that the graph G is sparse. Thus, FONDUE-NDA estimates G as the graph with the smallest set E for which c(G) = Ĝ . Practically, this means that an edge {i, j} ∈ Ê results in exactly one edge {k, l } ∈ E with k ∈ c−1 (i ) and l ∈ c−1 ( j), and that equivalent nodes k ∼c l with k, l ∈ V are never connected by an edge, i.e., {k, l } 6∈ E. This bias is justified by the sparsity of most ‘natural’ graphs, and our experiments indicate it is justified. We approach the NE-based NDA Problem 6 in a greedy and iterative manner. In each iteration, FONDUE-NDA identifies the node that has a split which will result in the smallest value of the cost function among all nodes. To further reduce the computational complexity, FONDUE-NDA only splits one node into two nodes at a time (e.g., Figure 1b), i.e., it splits node i into two nodes i0 and i00 with corresponding adjacency vectors ai0 , ai00 ∈ {0, 1}n , ai0 + ai00 = ai . We refer to such a split as a binary split. Note that repeated binary splits can of course be used to achieve the same result as a single split into several notes, so this assumption does not imply a loss of generality or applicability. Once the best binary split of the best node is identified, FONDUE-NDA splits that node and starts the next iteration. The evaluation of each split requires recomputing the embedding, and comparing the resulting optimal NE cost functions with each other. Unfortunately, this naive strategy is computationally intractable: computing a single NE is already computationally demanding for most (if not all) NE methods. Thus, having to compute a re-embedding for all possible splits, even binary ones (there are O(n2d ) of them, with n the number of nodes and d the maximal degree), is entirely infeasible for practical networks.
Appl. Sci. 2021, 11, 9884 9 of 28 3.3.1. A First-Order Approximation for Computational Tractability Thus, instead of recomputing the embedding, FONDUE-NDA performs a first-order analysis by investigating the effect of an infinitesimal split of a node i around its embedding xi , on the cost O(Ĝsi , X̂si ) obtained after performing the splitting, with Ĝsi and X̂si referring to the ambiguous graph and its embeddings’ representation, respectively, after splitting node i. Drawing intuition from Figure 1, when two distinct authors share the same name in a given collaboration network, their respective separate community (ego-network) are lumped into one big cluster. Yet, from a topological point of view, that ambiguous node (author name) is connected to both communities that are generally different, meaning they share very few, if any, links. This stems from the observation that it is highly unlikely that two authors with the same exact name would belong to the same community, i.e., collaborate together. Furthermore, splitting this ambiguous node into two different ones (distinguishing the two authors), would ideally separate these two communities. Thus, to do so, we consider that each community, that is supposed to be embedded separately, is pulling the ambiguous node towards its own embedding region, and once separated, the embeddings of each of the resolved nodes will be improved. So our main goal is to quantify the amount of improvements in the embedding cost function by separating the two nodes i0 and i00 by a unit distance in a certain direction. We propose to split the assignment of the edges of i between i0 and i00 , such that all the links from i are distributed to either i0 or i00 in such way to maximize the embedding cost function, which could be evaluated by computing the gradient with respect to the separation distance δi . Specifically, FONDUE-NDA seeks the split of node i that will result in embedding xi0 and xi00 with infinitesimal difference δ i (where δ i = xi0 − xi00 , xi0 = xi + δ2i , xi00 = xi − δ2i , and δ i → 0, e.g., Figure 1b), such that ||∇δ i O(Ĝsi , X̂si )|| is large, with ∇δ i O(Ĝsi , X̂si ) being the gradient of O(Ĝsi , X̂si ) with respect to δ i . This can be completed analytically. Indeed, applying the chain rule, we find: 1 1 ∇δ i O(Ĝsi , X̂si ) = ∇x O(Ĝsi , X̂si ) − ∇xi00 O(Ĝsi , X̂si ). (3) 2 i0 2 Many recent NE methods like LINE [24] and CNE [6], aim to embed ‘similar’ nodes in the graph closer to each other, and ‘dissimilar’ nodes further away from each other (for a particular similarity notion depending on the NE method). For such methods, Equation (3) can be further simplified. Indeed, as such NE methods focus on modeling a property of pairs of nodes (their similarity), their objective functions can be typically decomposed as a summation of node-pair interaction losses over all node-pairs. For example, this can be seen in Section 3.3.3 of the current paper for CNE [6], and in Equations (3) and (6) of [24] for LINE. Each of these node-pair interaction losses quantifies the extent to which the proximity between nodes’ embeddings reflects their ‘similarity’ in the network. For methods where this decomposition is possible, we can thus write the objective function as follows: O(G , X ) = ∑ O p ( Aij , xi , x j ) j:{i,j}∈V ×V = ∑ O p ( Aij = 1, xi , x j ) + ∑ O p ( Akl = 0, xk , xl ), j:{i,j}∈ E l:{k,l }∈ /E where O p ( Aij ,xi ,x j ) denotes the node-pair interaction loss for the nodes i and j, O p ( Aij = 1, xi , x j ) the part of objective function that corresponds to node i and node j with an edge between them (Aij = 1) and O p ( Akl = 0, xk , xl ) is the part of objective function, where node k and node l are disconnected.
Appl. Sci. 2021, 11, 9884 10 of 28 Given that Γ(i ) = Γ(i0 ) ∪ Γ(i00 ) and Γ(i0 ) ∩ Γ(i00 ) = ∅, we can apply the same decom- position approach on ∇xi0 O(Ĝsi , X̂si ), ∇xi0 O(Ĝsi , X̂si ) = ∇xi0 ∑ O p ( Ai0 j = 1, xi0 , x j ) +∇xi0 ∑ O p ( Ai0 l = 0, xi0 , xl ) j:j∈Γ(i0 ) / Γ (i 0 ) l:l ∈ = ∇ xi 0 ∑ O p ( Ai0 j = 1, xi0 , x j ) +∇xi0 ∑ O p ( Ai0 l = 0, xi0 , xl ) j:j∈Γ(i0 ) l:l ∈Γ(i00 ) +∇xi0 ∑ O p ( Ai0 l = 0, xi0 , xl ), / Γ (i ) l:l ∈ and similarly on ∇xi00 O(Ĝsi , X̂si ), ∇xi00 O(Ĝsi , X̂si ) = ∇xi00 ∑ O p ( Ai00 j = 1, xi00 , x j ) +∇xi00 ∑ O p ( Ai00 l = 0, xi00 , xl ) j:j∈Γ(i00 ) l:l ∈Γ(i0 ) +∇xi00 ∑ O p ( Ai00 l = 0, xi00 , xl ). / Γ (i ) l:l ∈ Additionally, as both nodes i0 and i00 share the same set of non-neighbors of node i, we can write the following: ∇ xi 0 ∑ O p ( Ai0 l = 0, xi0 , xl ) = ∇xi00 ∑ O p ( Ai00 l = 0, xi00 , xl ). / Γ (i ) l:l ∈ / Γ (i ) l:l ∈ Furthermore, incorporating the previous two decompositions, we can rewrite Equation (3) as follows: 1 2 j:j∈∑ ∇δ i O(Ĝsi , X̂si ) = ∇xi0 O p ( Ai0 j = 1, xi0 , x j ) − ∑ ∇xi00 O p ( Ai00 l = 0, xi00 , xl ) Γ (i 0 ) l:l ∈Γ(i0 ) + ∑ ∇xi0 O ( Ai0 l = 0, xi0 , xl ) − ∑ ∇xi00 O p ( Ai00 j = 1, xi00 , x j ) p l:l ∈Γ(i00 ) j:j∈Γ(i00 ) 1 ∑ ∇xi0 O p ( Ai0 j = 1, xi0 , x j ) − ∇xi00 O p ( Ai00 l = 0, xi00 , xl ) = 2 j:j∈Γ(i0 ) 1 ∑ ∇xi00 O p ( Ai00 j = 1, xi00 , x j ) − ∇xi0 O p ( Ai0 l = 0, xi0 , xl ) . − 2 l:l ∈Γ(i00 ) Given that Γ(i0 ){ = Γ(i ) − Γ(i0 ) = Γ(i00 ) and Γ(i00 ){ = Γ(i ) − Γ(i00 ) = Γ(i0 ), the above equation could be simplified to: 1 ∑ (−1)m ∇xi O p ( Aij = 1, xi , x j ) − ∇xi O p ( Aij = 0, xi , x j ) , ∇δ i O(Ĝsi , X̂si ) = (4) 2 j:j∈Γ(i ) with m = 0 if j ∈ Γ(i0 ) and m = 1 if j ∈ Γ(i00 ). Let Fi1 ∈ Rd×|Γ(i)| be a matrix of which the columns correspond to the gradient vectors ∇xi O p ( Aij = 1, xi , x j ) (one column for each j ∈ Γ(i )), and let Fi0 ∈ Rd×|Γ(i)| be the matrix of which the columns correspond to the gradient vectors ∇xi O p ( Aij = 0, xi , x j ) (also one column for each j ∈ Γ(i )). Moreover, let bi ∈ {1, −1}|Γi | be a vector with a dimension corresponding to each of the neighbors j ∈ Γ(i ) of i, with value equal to 1 if that neighbor is a neighbor of i0 and equal to −1 if it is a neighbor of i00 after splitting i. Then the gradient Equation (4) can be written more concisely and transparently as follows: 1 1 ∇δ i O(Ĝsi , X̂si ) = ( F − Fi0 )bi . 2 i
Appl. Sci. 2021, 11, 9884 11 of 28 The aim of FONDUE-NDA is to identify node splits for which the embedding quality improvement is maximal. As argued above, we propose to approximately quantify this by means of a first order approximation, by considering the two-norm squared of this gradient, namely by maximizing k∇δ i O(Ĝsi , X̂si )k2 . Denoting Mi = ( Fi1 − Fi0 )> ( Fi1 − Fi0 ) (and recognizing that bi> bi = |Γ(i )| is independent of bi ), FONDUE-NDA can thus be formalized in the following compact form: bi> Mi bi argmax . (5) i,bi ∈{1,−1}|Γi | bi> bi Note that Mi 0 for all nodes and all splits, such that this is an instance of Boolean quadratic maximization problem [25,26]. This problem is NP-hard, thus it requires further approximations to ensure tractability in practice. 3.3.2. Additional Heuristics for Enhanced Scalability In order to efficiently search for best split on a given node, we developed two approxi- mation heuristics. First, we randomly split the neighborhood Γ(i ) into two and evaluate the objective (Equation (5)). Repeat the randomization procedure for a fixed number of times, pick the split that gives the best objective value as output. Second, we find the eigenvector v that corresponds to the largest absolute eigenvalue of matrix Mi . Sort the element in vector v and assigning top k corresponding nodes to Γ(i0 ) and the rest to Γ(i00 ). Evaluating the objective value for k = 1 . . . |Γ(i )| and pick the best split. Finally, we combine theses two heuristics and use the split that gives the best objective Equation (5) as the final split of the node i. 3.3.3. FONDUE-NDA Using CNE We now apply FONDUE-NDA to conditional network embedding (CNE). CNE pro- poses a probability distribution for network embedding and finds a locally optimal embed- ding by maximum likelihood estimation. CNE has objective function: O(G , X ) = log( P( A| X )) = ∑ log Pij ( Aij = 1| X ) + ∑ log Pij ( Aij = 0| X ). (6) i,j:Aij =1 i,j:Aij =0 Here, the link probabilities Pij conditioned on the embedding are defined as follows: PA,ij N+,σ1 (kxi − x j k) Pij ( Aij = 1| X ) = , PA,ij N+,σ1 (kxi − x j k) + (1 − PA,ij )N+,σ2 (kxi − x j k) where N+,σ denotes a half-normal distribution [27] with spread parameter σ, σ2 > σ1 = 1, and where PÂ,ij is a prior probability for a link to exist between nodes i and j as inferred from the degrees of the nodes (or based on other information about the structure of the network [28]). First, we derive the gradient: ∇xi O(G , X ) = γ ∑ (xi − x j ) P Aij = 1| X − Aij = 0, j 6 =i 1 1 where γ = σ12 − σ22 . This allows us to further compute gradient γ . .. ∇δ i O(Ĝsi , X̂si ) = − . xi − x j bi 2 . .
Appl. Sci. 2021, 11, 9884 12 of 28 Thus, the Boolean quadratic maximization problem has form: bi> ∑k,l ∈Γ(i) (xi − xk )(xi − xl )> bi argmax . (7) i,bi ∈{1,−1}|Γi | bi> bi 3.4. FONDUE-NDD Using the inductive bias for the NDD problem, the goal is to minimize the embedding cost after merging the duplicate nodes in the graph (Equation (2)). This is motivated by the fact that natural networks tend to be modeled using NE methods, better than corrupted (duplicate) networks, thus their embedding cost should be lower. Thus, merging (or contracting) duplicate nodes (nodes that refer to the same entity) in a duplicate graph Ĝ would result in a contracted graph Ĝc that is less corrupt (resembling more a “natural” graph), thus with a lower embedding cost. Contrary to NDA, NDD is more straightforward, as it does not deal with the problem of reassigning the edges of the node after splitting, but rather simply determining the duplicate nodes in a duplicate graph. FONDUE-NDD applied on Ĝ , aims to find duplicate node-pairs in the graph to combine them into one node by reassigning the union of their edges, which would result in contracted graph Ĝc . Using NE methods, FONDUE-NDD aims to iteratively identify a node-pair {i, j} ∈ V̂cand , where V̂cand is the set of all possible candidate node-pairs, that if merged together to form one node im , would result in the smallest cost function value among all the other node-pairs. Thus, problem 6 can be further rewritten as: argmin O Ĝcij , X̂cij , (8) {i,j}∈V̂cand where Ĝcij is a contracted graph from Ĝ after merging the node-pair {i, j} , and X̂cij its respective embeddings. Trying this for all possible node-pairs in the graph is an intractable solution. It is not obvious what information could be used to approximate Equation (8), thus we approach the problem simply by randomly selecting node-pairs, merging them, observing the values of the cost function, and then ranking the result. The lower the cost score, the more likely that those merged nodes are duplicates. Lacking a scalable bottom-up procedure to identify the best node pairs, in the exper- iments our focus will be on evaluation whether the introduced criterion for merging is indeed useful to identify whether node pairs appear to be duplicates. FONDUE-NDD Using CNE Similarly to the previous section, we proceed by applying CNE as a network embed- ding method, the objective function of FONDUE-NDD is thus the one of CNE evaluated on the tentatively deduplicated graph after attempting a merge: σ1 1 − Pkl 1 1 d2 O(Ĝcij , X̂cij ) = log( P( Âcij | X̂cij )) = ∑ log 1 + σ2 Pkl exp (( 2 − 2 ) kl ) σ1 σ2 2 k,l: Âc =1 ij ,kl (9) σ2 Pkl 1 1 d2 + ∑ log 1 + σ1 1 − Pkl exp (( 2 − 2 ) kl ) , σ2 σ1 2 k,l: Âc =0 ij ,kl with the link probabilities Pkl conditioned on the embedding are defined as follows: PÂ cij ,kl ,kl N+,σ1 (kxk − xl k) Pkl ( Âcij ,kl = 1| X ) = . PÂ cij ,kl ,kl N+,σ1 (kxk − xl k) + (1 − PÂc ,kl )N+,σ2 (kxk − xl k) ij ,kl
Appl. Sci. 2021, 11, 9884 13 of 28 Similarly to Section 3.3.3, N+,σ denotes a half-Normal distribution with spread pa- rameter σ, σ2 > σ1 = 1, and where PÂ ,kl is a prior probability for a link to exist between cij ,kl nodes k and l as inferred from the network properties. 4. Experiments In this section, we investigate quantitatively and qualitatively the performance of FONDUE on both semi-synthetic and real-world datasets, compared to state-of-the-art methods tackling the same problems. In Section 4.1, we introduce and discuss the different datasets used in our experiments, in Section 4.2 we discuss the performance of FONDUE- NDA, and FONDUE-NDD in Section 4.3. Finally, in Section 4.4, we summarize and discuss the results. All code used in this section is publicly available from the GitHub repository https://github.com/aida-ugent/fondue, accessed on 20 October 2021. 4.1. Datasets One main challenge for assessing the evaluation of disambiguation tasks is the scarcity of availability of ambiguous (contracted) graph datasets with reliable ground truth. Further- more, other studies that focus on ambiguous node identification often do not publish their heavily processed dataset (e.g., DBLP datasets [16]), which makes it harder to benchmark different methods. Thus, to simulate data corruption in real world datasets, we opted to create a contracted graph given a source graph, and then use the latter as ground truth to assess the accuracy of FONDUE compared to other baselines. To do so, we used a simple approach for node contraction, for both NDA (Section 4.2.1) and NDD (Section 4.3.1). Below, in Table 1 we list the details of the different datasets used after post-processing in our experiments. Additionally, we also use real-world networks containing ambiguous and duplicate nodes, mainly part of the PubMed collaboration network, analyzed in Appendix A. The PubMed data are released in independent issues, so to build a connected network form the PubMed data, we select issues that contain ambiguous and duplicate nodes. We then select the largest connected component of that network. One main limitation to this dataset is that not every author has an associated Orcid ID, which affects the false positive and false negative labels in the network (author names that might be ambiguous would be ignored). This is further highlighted in the subsequent sections. 4.2. Node Disambiguation In this section, we investigate the following questions: (Q1 ) Quantitatively, how does our method perform in identifying ambiguous nodes compared to the state-of-the-art and other heuristics? (Section 4.2.2); (Q2 ) Qualitatively, how reliable is the quality of the detected ambiguous nodes compared to other methods when applied to real world datasets? (Section 4.2.3); (Q3 ) Quantitatively, how does our method perform in terms of splitting the ambiguous nodes? (Section 4.2.4); (Q4 ) How does the behavior of the method change when the degree of contraction of a network varies? (Section 4.2.5); (Q5 ) Does the proposed method scale? (Section 4.2.6); (Q6 ) Quantitatively, how does our method perform in terms of node deduplication? (Section 4.3.1). 4.2.1. Data Processing Before conducting the experiments, the processing of the data to generate semi- synthetic networks was needed. This was completed by contracting each of the thirteen datasets mentioned in Table 2. More specifically, for each network G = (V, E), a graph contraction was performed to create a contracted graph Ĝ = (V̂, Ê) (ambiguous) by randomly merging a fraction r of total number of nodes, to create a ground truth to test our proposed method. This is completed by first specifying the fraction of the nodes in the graph to be contracted (r ∈ {0.001, 0.01, 0.1}), and then sampling two sets of vertices, V̂ i ⊂ V̂ and V̂ j ⊂ V̂, such that |V̂ i | = |V̂ j | = br · |V̂ |c and V̂ i ∩ V̂ j = ∅. Then, every vertex v j ∈ V̂ j is merged with the corresponding vertex vi ∈ V̂ i by reassigning the links connected
Appl. Sci. 2021, 11, 9884 14 of 28 to v j to vi and removing v j from the network. The node-pairs (vi , v j ) later serve as ground truths. We have also tested the case where the set of the candidate contracted vertices have no common neighbors (instead of a uniform selection at random). This mimics some types of social networks where two authors that share the same name, often their ego-networks do not intersect. Further analysis of the PubMed dataset Table 1, revealed that none of the ambiguous nodes shared edges with the neighbors of another ambiguous node. We have tested the performance of FONDUE-NDA, as well as that of the competing methods listed in the following section, on fourteen different datasets listed in Table 1, with their properties shown in Table 2. Table 1. The different datasets used in our experiments (Section 4.1). Datasets Description Facebook Social Circles network [29] FB-SC Consists of anonymized friends list from Facebook. Page-Page graph of verified Facebook pages [29]. Nodes represent official Facebook pages while the FB-PP links are mutual likes between pages. Anonymized network generated using email data from a large European research institution modeling the incoming and email outgoing email exchange between its members [29]. A database network of the Computer Science department of the University of Antwerp that represent the connections between students, STD professors and courses [30]. A subnetwork of the BioGRID Interaction Database PPI [31], that uses PPI network for Homo Sapiens. A network depicting the coappearance of characters in lesmis the novel Les Miserables [32]. A coauthorship network of scientists working on network theory and netscience experiments [29]. Network of books about US politics, with edges between books polbooks representing frequent copurchasing of books by the same buyers. http://www-personal.umich.edu/~mejn/netdata/, accessed on 20 October 2021 CondMat Collaboration network of Arxiv Condensed Matter Physics [33] GrQc Collaboration network of Arxiv General Relativity [33] HepTh Collaboration network of Arxiv Theoretical High Energy Physics [33] CM03 Collaboration network of Arxiv Condensed Matter till 2003 [33] CM05 Collaboration network of Arxiv Condensed Matter till 2005 [33] Collaboration network extracted from the PubMed database (analyzed in PubMed Appendix A), containing 2122 nodes with ground truth of 31 ambiguous nodes (6 of which maps to more than 2 entities) and 1 duplicate node [1] Table 2. Various properties of each semi-synthetic network used in our experiments. fb-sc fb-pp email lesmis polbooks STD # Nodes 4039 22,470 986 77 105 395 # Edges 88,234 170,823 16,064 254 441 3423 Avg degree 43.7 15.2 32.6 6.6 8.4 17.3 Density 1.1 × 102 6.8 × 104 3.3 × 102 8.7 × 102 8.1 × 102 4.4 × 102 ppi netscience GrQc CondMat HepTh CM05 CM03 # Nodes 3852 379 4158 21,363 8638 36,458 27,519 # Edges 37,841 914 13,422 91,286 24,806 171,735 116,181 Avg degree 19.6 4.8 6.5 8.5 5.7 9.4 8.4 Density 5.1 × 103 1.3 × 102 1.6 × 103 4.0 × 104 6.6 × 104 2.6 × 104 3.1 × 104
Appl. Sci. 2021, 11, 9884 15 of 28 4.2.2. Quantitative Evaluation of Node Identification In this section, we focus on answering Q1 , namely, given a contracted graph, FONDUE- NDA aims to identify the list of contracted (ambiguous) nodes present in it. We first discuss the datasets used in the experiments in the following section. Baselines. As mentioned earlier in Section 1, most entity disambiguation methods in the literature focus on the task of re-assigning the edges of an already predefined set of ambiguous nodes, and the process of identifying these nodes in a given non-attributed network, is usually overlooked. Thus, there exists very few approaches that tackle the latter case. In this section, we compare FONDUE-NDA with three different competing approaches that focus on the identification task, one existing method, and two heuristics. Normalized-Cut (NC) The work of [16] comes close to ours, as their method also aims to identify ambiguous nodes in a given graph by utilizing Markov clustering to cluster an ego network of a vertex u with the vertex itself removed. NC favors the grouping that gives small cross-edges between different clusters of u’s neighbors. The result is a score reflecting the quality of the clustering, using normalized-cut (NC): k W (C , C ) NC = ∑ W (C , C ) +i Wi(C , C ) , i =1 i i i i with W (Ci , Ci ) as the sum of all the edges within cluster Ci , W (Ci , Ci ) the sum of the for all the edges between cluster Ci and the rest of the network Ci , and k being the number of clusters in the graph. Although [17] also worked on identifying nodes based on topological features, their method (which is not publicly available) performed worse in all the cases when compared to [16], so we only chose the latter as a competing baseline. Connected-Component Score (CC) We also include another baseline, connected-component score (CC), relying on the same approach used in [16], with a slight modification. Instead of computing the normalized cut score based on the clusters of the ego graph of a node, we account for the number of connected components of a node’s ego graph, with the node itself removed. Degree Finally, we use node degree as a baseline. As contracted nodes usually tend to have a higher degree, by inheriting more edges from combined nodes, degree is a simple predictor for the node ambiguity. Evaluation Metric. FONDUE-NDA ranks nodes according to their calculated ambiguity score (how likely is that node to be ambiguous). The same process goes for NC and CC. At first glance, the evaluation can be approached from a binary classification perspective, by considering the top X ranked nodes as ambiguous (where X is the actual number true- positive), and, thus, we can use the usual metrics for binary classification, such as F1-score, precision, recall and AUC. However, this requires knowing beforehand the number of true-positive, i.e., the number of actual ambiguous nodes (or setting a clear cutoff value), which is only possible in labeled datasets and controlled experiments. In real world settings, if FONDUE-NDA is to be used to detect ambiguous nodes in unlabeled networks, practical application is rather more restricted, as it is more useful to have relevant nodes (ambiguous) ranked more highly than non-relevant nodes. Thus, it is necessary to extend the traditional binary classification evaluation methods, that are based on binary relevance judgments, to more flexible graded relevance judgments, such as, for example, cumulative gain, which is a form of graded precision, as it is identical to the precision when rating scale is binary. However, as our datasets are highly imbalanced by nature, mainly because ambiguous nodes are by definition a small part of the network, a better take on the cumulative gain metric is needed. Hence, we employ the normalized discounted gain to evaluate our method, alongside the traditional binary classification methods listed above. Below, we detail each metric.
Appl. Sci. 2021, 11, 9884 16 of 28 Precision The number of correctly identified positive results divided by the number of all positive results TP Precision = TP + FP Recall The number of correctly identified positive results divided by the number of all positive samples TP Recall = TP + FN F1-score It is the weighted average of the precision where an F1 score reaches its best value at 1 and worst score at 0. 2 ∗ Recall × Precision F1 = Recall + Precision Note that, due to the fact that in the binary classification case, the number of false positive is equal to the number of false negative, the value of the recall, precision and F1-score will be the same. Area Under the ROC curve (AUC) A ROC curve is a 2D depiction of a classifier perfor- mance, which could be reduced to a single scalar value, by calculating the value under the curve (AUC). Essentially, the AUC computes the probability that our measure would rank a randomly chosen ambiguous node (positive example), higher than a randomly chosen non-ambiguous node (negative example). Ideally, this probability value is 1, which means our method has successfully identified ambiguous nodes 100% of the time, and the baseline value is 0.5, where the ambiguous and non-ambiguous nodes are indistinguishable. This accuracy measure has been used in other works in this field, including [16], which makes it easier to compare to their work. Discounted Gain (DCG) The main limitation of the previous method, as we discussed earlier, is inability to account for graded scores, but rather only binary classification. To account for this, we utilize different cumulative gain based methods. Given a search result list, cumulative gain (CG) is the sum of the graded relevance values of all results. n CG = ∑ relevancei i =1 On the other hand, DCG [34] takes position significance into account, and adds a penalty if a highly relevant document is appearing lower in a search result list, as the graded relevance value is logarithmically reduced proportionally to the position of the result. Practically, it is the sum of the true scores ranked in the order induced by the predicted scores, after applying a logarithmic discount. The higher the better is the ranking. n relevancei DCG = ∑ log i =1 2 ( i + 1) Normalized Discounted Gain (NDCG) It is commonly used in the information retrieval field to measure effectiveness of search algorithms, where highly relevant documents being more useful if appearing earlier in search result, and more useful than marginally relevant documents which are better than non-relevant documents. It improves upon DCG by accounting for the variation of the relevance, and providing a proper upper and lower bounds to be averaged across all the relevance scores. Thus, it is computed by summing the true scores ranked in the order induced by the predicted scores, after applying a logarithmic discount, then dividing by the best possible score ideal DCG (IDCG, obtained for a perfect ranking) to obtain a score between 0 and 1. NDCG NDCG = IDCG
Appl. Sci. 2021, 11, 9884 17 of 28 Evaluation pipeline. We first perform network contraction on the original graph, by fixing the ratio of ambiguous nodes to r. We then embed the network using CNE, and compute the disambiguation measure of FONDUE-NDA (Equation (7)), as well as the baseline measures for each node. Then, the scores yield by the measures are compare to the ground truth (i.e., binary labels indicating whether a node is a contracted node). This is completed for 3 different values of r ∈ {0.001, 0.01, 0.1}. We repeat the processes 10 times using a different random seed to generate the contracted network and average the scores. For the embedding configurations, we set the parameters for CNE to σ1 = 1, σ2 = 2, with dimensionality limited to d = 8. Results. are illustrated in Figure 3 and shown in detail in Table 3 focusing on NDCG mainly for being a better measure for assessing the ranking performance of each method. FONDUE-NDA outperforms the state-of-the-art method, as well as non-trivial baselines in terms of NDCG in most datasets. It is also more robust with the variation of the size of the network, and the fraction of the ambiguous nodes in the graph. NC seems to struggle to identify ambiguous nodes for smaller networks (Table 2). Additionally, as we tested against multiple network settings, with randomly uniform contraction (randomly selecting a node-pair and merging them together), or a conditional contraction (selecting a node pair that do not share common neighbors to mimic realistically collaboration networks), we did not observe any significant changes in the results. Table 3. Performance evaluation (NDCG) on multiple datasets for our method compared with other baselines, for two different contraction methods. Note that for some datasets with small number of nodes, we did not perform any contraction for 0.001 as the number of contracted nodes in this case is very small, thus we replaced the values for those methods by “−”. Ambiguity 10% 1% 0.1% Rate Method FONDUE-NDA NC CC Degree FONDUE-NDA NC CC Degree FONDUE-NDA NC CC Degree fb-sc 0.954 0.962 0.768 0.776 0.767 0.875 0.569 0.423 0.679 0.535 0.272 0.199 Randomly Uniform Contraction fb-pp 0.899 0.825 0.821 0.804 0.649 0.532 0.528 0.511 0.374 0.268 0.268 0.253 email 0.783 0.661 0.619 0.704 0.529 0.305 0.264 0.310 − − − − student 0.778 0.664 0.568 0.652 0.396 0.328 0.235 0.257 − − − − lesmis 0.906 0.570 0.499 0.622 − − − − − − − − polbooks 0.972 0.604 0.534 0.698 1.000 0.310 0.267 0.318 − − − − ppi 0.759 0.670 0.724 0.741 0.420 0.353 0.381 0.387 0.194 0.138 0.147 0.151 netscience 0.886 0.784 0.731 0.721 0.508 0.378 0.323 0.288 − − − − GrQc 0.857 0.805 0.796 0.768 0.603 0.447 0.437 0.415 0.249 0.195 0.184 0.168 CondMat 0.864 0.855 0.843 0.816 0.601 0.553 0.543 0.520 0.367 0.278 0.269 0.255 HepTh 0.860 0.798 0.823 0.796 0.582 0.466 0.494 0.470 0.325 0.201 0.224 0.208 cm05 0.884 0.873 0.859 0.827 0.627 0.590 0.582 0.545 0.471 0.312 0.307 0.288 cm03 0.888 0.869 0.852 0.823 0.635 0.577 0.562 0.534 0.335 0.297 0.281 0.272 fb-sc 0.953 0.989 0.768 0.764 0.730 0.933 0.591 0.418 0.399 0.665 0.321 0.172 fb-pp 0.895 0.826 0.820 0.801 0.650 0.532 0.529 0.510 0.389 0.266 0.267 0.253 no common neighbors email 0.676 0.696 0.625 0.604 0.303 0.319 0.288 0.256 − − − − student 0.659 0.726 0.531 0.587 0.368 0.447 0.201 0.229 − − − − lesmis 0.755 0.591 0.498 0.486 − − − − − − − − polbooks 0.981 0.620 0.544 0.696 1.000 0.268 0.420 0.363 − − − − ppi 0.725 0.673 0.721 0.700 0.398 0.352 0.381 0.373 0.166 0.139 0.147 0.144 netscience 0.877 0.797 0.714 0.705 0.622 0.372 0.304 0.290 − − − − GrQc 0.861 0.806 0.794 0.766 0.580 0.445 0.435 0.416 0.280 0.197 0.183 0.173 CondMat 0.863 0.855 0.843 0.815 0.585 0.554 0.542 0.516 0.317 0.274 0.273 0.257 HepTh 0.856 0.798 0.824 0.796 0.581 0.467 0.494 0.480 0.285 0.204 0.212 0.213 cm05 0.883 0.874 0.858 0.825 0.633 0.591 0.582 0.543 0.414 0.310 0.312 0.289 cm03 0.884 0.869 0.853 0.822 0.651 0.577 0.561 0.533 0.439 0.295 0.279 0.271
You can also read