FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings

Page created by Sue Buchanan
 
CONTINUE READING
applied
              sciences
Article
FONDUE: A Framework for Node Disambiguation and
Deduplication Using Network Embeddings †
Ahmad Mel              , Bo Kang             , Jefrey Lijffijt   and Tijl De Bie *

                                               AIDA, IDLab-ELIS, Ghent University, 9052 Ghent, Belgium; ahmad.mel@ugent.be (A.M.);
                                               bo.kang@ugent.be (B.K.); jefrey.lijffijt@ugent.be (J.L.)
                                               * Correspondence: tijl.debie@ugent.be
                                               † This paper is an extended version of our paper published in IEEE DSAA 2020 The 7th IEEE International
                                                 Conference on Data Science and Advanced Analytics.

                                               Featured Application: FONDUE can be used to preprocess graph structured data, in particular it
                                               facilitates detecting nodes in the graph that represent the same real-life entity, and for detecting
                                               and optimally splitting nodes that represent multiple distinct real-life entities. FONDUE does
                                               this in an entirely unsupervised fashion, relying exclusively on the topology of the network.

                                               Abstract: Data often have a relational nature that is most easily expressed in a network form,
                                               with its main components consisting of nodes that represent real objects and links that signify the
                                               relations between these objects. Modeling networks is useful for many purposes, but the efficacy
                                               of downstream tasks is often hampered by data quality issues related to their construction. In
                                     many constructed networks, ambiguity may arise when a node corresponds to multiple concepts.
         
                                               Similarly, a single entity can be mistakenly represented by several different nodes. In this paper, we
Citation: Mel, A.; Kang, B.; Lijffijt, J.;     formalize both the node disambiguation (NDA) and node deduplication (NDD) tasks to resolve these
De Bie, T. FONDUE: A Framework                 data quality issues. We then introduce FONDUE, a framework for utilizing network embedding
for Node Disambiguation and                    methods for data-driven disambiguation and deduplication of nodes. Given an undirected and
Deduplication Using Network                    unweighted network, FONDUE-NDA identifies nodes that appear to correspond to multiple entities
Embeddings. Appl. Sci. 2021, 11, 9884.
                                               for subsequent splitting and suggests how to split them (node disambiguation), whereas FONDUE-
https://doi.org/10.3390/
                                               NDD identifies nodes that appear to correspond to same entity for merging (node deduplication),
app11219884
                                               using only the network topology. From controlled experiments on benchmark networks, we find
Academic Editors: Paola Velardi and
                                               that FONDUE-NDA is substantially and consistently more accurate with lower computational cost
Stefano Faralli                                in identifying ambiguous nodes, and that FONDUE-NDD is a competitive alternative for node
                                               deduplication, when compared to state-of-the-art alternatives.
Received: 2 August 2021
Accepted: 18 October 2021                      Keywords: node disambiguation; node deduplication; node linking; entity linking; network embed-
Published: 22 October 2021                     dings; representation learning

Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-        1. Introduction
iations.
                                                     Increasingly, collected data naturally comes in the form of a network of interre-
                                               lated entities. Examples include social networks describing social relations between
                                               people (e.g., Facebook), citation networks describing the citation relations between pa-
                                               pers (e.g., PubMed [1]), biological networks, such as those describing interactions between
Copyright: © 2021 by the authors.              proteins (e.g., DIP [2]), and knowledge graphs describing relations between concepts or
Licensee MDPI, Basel, Switzerland.             objects (e.g., DBPedia [3]). Thus, new machine learning, data mining, and information
This article is an open access article         retrieval methods are increasingly targeting data in their native network representation.
distributed under the terms and
                                                     An important problem across all the fields of data science, broadly speaking, is data
conditions of the Creative Commons
                                               quality. For problems on networks, especially those that are successful in exploiting fine- as
Attribution (CC BY) license (https://
                                               well as coarse-grained structure of networks, ensuring good data quality is perhaps even
creativecommons.org/licenses/by/
                                               more important than in standard tabular data. For example, an incorrect edge can have
4.0/).

Appl. Sci. 2021, 11, 9884. https://doi.org/10.3390/app11219884                                                   https://www.mdpi.com/journal/applsci
Appl. Sci. 2021, 11, 9884                                                                                                2 of 28

                            a dramatic effect on the implicit representation of other nodes, by dramatically changing
                            distances on the network. Similarly, mistakenly representing distinct real-life entities by
                            the same node in the network may dramatically alter its structural properties, by increasing
                            the degree of the node and by merging the possibly quite distinct neighborhoods of these
                            entities into one. Conversely, representing the same real-life entity by multiple nodes can
                            also negatively affect the topology of the graph, possibly even splitting apart communities.
                                  Although identifying missing edges and, conversely, identifying incorrect edges, can
                            be tackled adequately using link prediction methods, prior work has neglected the other
                            task: identifying and appropriately splitting nodes that are ambiguous—i.e., nodes that
                            correspond to more than one real-life entity. We will refer to this task as node disambiguation
                            (NDA). A converse and equally important problem is the problem of identifying multi-
                            ple nodes corresponding to the same real-life entity,a problem we will refer to as node
                            deduplication (NDD).
                                  This paper proposes a unified and principled framework to both NDA and NDD
                            problems, called framework for node disambiguation and deduplication using network
                            embeddings (FONDUE). FONDUE is inspired by the empirical observation that real (natu-
                            ral) networks tend to be easier to embed than artificially generated (unnatural) networks,
                            and rests on the associated hypothesis that the existence of ambiguous or duplicate nodes
                            makes a network less natural.
                                  Although most of the existing methods tackling NDA and NDD make use of additional
                            information (e.g., node attributes, descriptions, or labels) for identifying and processing
                            these problematic nodes, FONDUE adopts a more widely applicable approach that relies
                            solely on topological information. Although exploiting additional information may of
                            course increase the accuracy on those tasks, we argue that a method that does not require
                            such information offers unique advantages, e.g., when data availability is scarce, or when
                            building an extensive dataset on top of the graph data, is not feasible for practical reasons.
                            Additionally, this approach fits the privacy by design framework, as it eliminates the
                            need to incorporate more sensitive data. Finally, we argue that, even in cases where
                            such additional information is available, it is both of scientific and of practical interest
                            to explore how much can be completed without using it, instead solely relying on the
                            network topology. Indeed, although this is beyond the scope of the current paper, it is clear
                            that methods that solely rely on network topology could be combined with methods that
                            exploit additional node-level information, plausibly leading to improved performance of
                            either type of approach individually.

                            1.1. The Node Disambiguation Problem
                                 We address the problem of NDA in the most basic setting: given a network, un-
                            weighted, unlabeled, and undirected, the task considered is to identify nodes that cor-
                            respond to multiple distinct real-life entities. We formulate this as an inverse problem,
                            where we use the given ambiguous network (which contains ambiguous nodes) in order
                            to retrieve the unambiguous network (in which all nodes are unambiguous). Clearly, this
                            inverse problem is ill-posed, making it impossible to solve without additional information
                            (which we do not want to assume) or an inductive bias.
                                 The key insight in this paper is that such an inductive bias can be provided by the
                            network embedding (NE) literature. This literature has produced embedding-based models
                            that are capable of accurately modeling the connectivity of real-life networks down to the
                            node-level, while being unable to accurately model random networks [4,5]. Inspired by this
                            research, we propose to use as an inductive bias the fact that the unambiguous network
                            must be easy to model using a NE. Thus, we introduce FONDUE-NDA, a method that
                            identifies nodes as ambiguous if, after splitting, they maximally improve the quality of the
                            resulting NE.

                            Example 1. Figure 1a illustrates the idea of FONDUE for NDA applied on a single node. In this
                            example, node i with embedding xi corresponds to two real-life entities that belong to two separate
Appl. Sci. 2021, 11, 9884                                                                                                   3 of 28

                            communities, visualized by either full or dashed lines, to highlight the distinction. Because node i is
                            connected to two different communities, most NE methods would locate its embedding xi between
                            the embeddings of the nodes from both communities. Figure 1b shows a split of node i into nodes i0
                            and i00 , each with connections only to one of both communities. The resulting network is easy to
                            embed by most NE methods, with embeddings xi0 and xi00 close to their own respective communities.
                            In contrast, Figure 1c shows a split where the two resulting nodes are harder to embed. Most
                            NE methods would embed them between both communities, but substantial tension would remain,
                            resulting in a worse value of the NE objective function.

                            Figure 1. (a) A node that corresponds to two real-life entities that belongs to two communities. Links
                            that connect the node with different communities are plotted in either full lines or dashed lines.
                            (b) an ideal split that aligns well with the communities. (c) a less optimal split.

                            1.2. The Node Deduplication Problem
                                 The same inductive bias can be used also for the NDD problem. The NDD problem is
                            that given a network, unweighted, unlabeled, and undirected, identify distinct nodes that
                            correspond to the same real-life entity. To this end, FONDUE-NDD determines how well
                            merging two given nodes into one would improve the embedding quality of NE models.
                            The inductive bias considers a merge as better than another one if it results in a better value
                            of the NE objective function.
                                 The diagram in Figure 2 shows the suggested pipeline for tackling both problems.

                                                                   Data Sources
                                                               •    Structured data
                                                               •    Documents
                                                               •    Graph data
                                                               •    Etc …

                                 Problem: Node Ambiguation                            Problem: Node Duplication

                                                              Data Corruption
                                                             • Data Collection
                                                             • Data Processing

                                          splitting                                   contraction
                                                                    FONDUE
                                                              • Help Identify
                                                                Corrupted Nodes in
                                                                the graph

                                 Task: Node Disambiguation                             Task: Node Deduplication
                                      FONDUE-NDA                                          FONDUE-NDD

                            Figure 2. FONDUE pipeline for both NDA and NDD. Data corruption can lead to two types of
                            problems: node ambiguation (e.g., multiple authors sharing the same name represented with one
                            node in the network) in the left part of the diagram, and node duplication (e.g., one author with
                            name variation represented by more than 1 node in the network). We then define two tasks to resolve
                            both problems separately using FONDUE.
Appl. Sci. 2021, 11, 9884                                                                                         4 of 28

                            1.3. Contributions
                                 In this paper, we make a number of related contributions:
                            •    We propose FONDUE, a framework exploiting the empirical observation that naturally
                                 occurring networks can be embedded well using state-of-the-art NE methods, to tackle
                                 two distinct tasks: node deduplication (FONDUE-NDD) and node disambiguation
                                 (FONDUE-NDA). The former, by identifying nodes as more likely to be duplicated
                                 if contracting them enhances the quality of an optimal NE. The latter, by identifying
                                 nodes as more likely to be ambiguous if splitting them enhances the quality of an
                                 optimal NE;
                            •    In addition this conceptual contribution, substantial challenges had to be overcome to
                                 implement this idea in a scalable manner. Specifically for the NDA problem, through
                                 a first-order analysis we derive a fast approximation of the expected NE quality
                                 improvement after splitting a node;
                            •    We implemented this idea for CNE [6], a recent state-of-the-art NE method, although
                                 we demonstrate that the approach can be applied for a broad class of other NE
                                 methods as well;
                            •    We tackle the NDA problem, with extensive experiments over a wide range of net-
                                 works demonstrate the superiority of FONDUE over the state-of-the-art for the identi-
                                 fication of ambiguous nodes, and this at comparable computational cost;
                            •    We also empirically observe that, somewhat surprisingly, despite the increase in
                                 accuracy for identifying ambiguous nodes, no such improvement was observed for the
                                 ambiguous node splitting accuracy. Thus, for NDA, we recommend using FONDUE
                                 for the identification of ambiguous nodes, while using an existing state-of-the-art
                                 approach for optimally splitting them;
                            •    Experiments on four datasets for NDD demonstrate the viability of FONDUE-NDD
                                 for the NDD problem based on only the topological features of a network.

                            2. Related Work
                                  The problem of NDA differs from named-entity disambiguation (NED; also known
                            as named entity linking), a natural language processing (NLP) task where the purpose is to
                            identify which real-life entity from a list a named-entity in a text refers to. For example,
                            in the ArnetMiner dataset [7] ‘Bin Zhu’ corresponds to more than 10+ authors. The Open
                            Researcher and Contributor ID (ORCID) [8] was introduced to solve the author name
                            ambiguity problem, and most NED methods rely on ORCID for labeling datasets.
                                 NED in this context aims to match the author names to unique (unambiguous) author
                            identifiers [7,9–11].
                                 In [7], they exploit hidden Markov random fields in a unified probabilistic framework
                            to model node and edge features. On the other hand, Zhang et al. [12] designed a com-
                            prehensive framework to tackle name disambiguation, using complex feature engineering
                            approach. By constructing paper networks, using the information sharing between two
                            papers to build a supervised model for assigning the weights of the edges of the paper
                            network. If two nodes in the network are connected, they are more likely to be authored by
                            the same person.
                                 Recent approaches are increasingly relying on more complex data, Ma et al. [13] used
                            heterogeneous bibliographic networks representation learning, by employing relational
                            and paper-related textual features, to obtain the embeddings of multiple types of nodes,
                            while using meta-path based proximity measures to evaluate the neighboring and structural
                            similarities of node embedding in the heterogeneous graphs.
                                 The work of Zhang et al. [9] focusing on preserving privacy using solely the link
                            information in a graph, employs network embedding as an intermediate step to perform
                            NED, but they rely on other networks (person–document and document–document) in
                            addition to person–person network to perform the task.
                                 Although NDA could be used to assist in NED tasks, NED typically strongly relies on
                            the text, e.g., by characterizing the context in which the named entity occurs (e.g., paper
Appl. Sci. 2021, 11, 9884                                                                                                 5 of 28

                            topic) [14]. Similarly, Ma et al. [15] proposes a name disambiguation model based on
                            representation learning employing attributes and network connections, by first encoding
                            the attributes of each paper using variational graph auto-encoder, then computing a
                            similarity metric from the relationship of these attributes, and then using graph embedding
                            to leverage the author relationships, heavily relying on NLP.
                                  In NDA, in contrast, no natural language is considered, and the goal is to rely on just
                            the network’s connectivity in order to identify which nodes may correspond to multiple
                            distinct entities. Moreover, NDA does not assume the availability of a list of known
                            unambiguous entity identifiers, such that an important part of the challenge is to identify
                            which nodes are ambiguous in the first place. This offers a more privacy-friendly advantage
                            and extends the application towards more datasets where access to additional information
                            is restricted or not possible.
                                  The research by Saha et al. [16], and Hermansson et al. [17] is most closely related to
                            ours. These papers also only use topological information of the network for NDA. Yet,
                            Ref. [16] also require timestamps for the edges, while [17] require a training set of nodes
                            labeled as ambiguous and non-ambiguous. Moreover, even though the method proposed
                            by [16] is reportedly orders of magnitude faster than the one proposed by [17], it remains
                            computationally substantially more demanding than FONDUE (e.g., [16] evaluate their
                            method on networks with just 150 entities). Other recent work using NE for NED [9,18–20]
                            is only related indirectly as they rely on additional information besides the topology of
                            the network.
                                  The literature on NDD is scarce, as the problem is not well-defined. Conceptually,
                            it is similar to that of named entity linking (NEL) [11,21] problem which aims to link
                            instances of named entities in a text such as a newspaper, articles to the corresponding
                            entities, often in knowledge bases (KB). Consequently, NEL heavily relies on textual data
                            to identify erroneous entities rather than entity connection which is the core of our method.
                            KB approaches for NEL are dominant in the field [22,23], as they make use of knowledge
                            base datasets, heavily relying on labeled and additional graph data to tackle the named
                            entity linking task. This also poses a challenge when it comes to benchmarking our method
                            for NDD. No identified studies that tackles NDD from a topological approach is present in
                            the current literature, at least without reliance on additional attributes and features.

                            3. Methods
                                 Section 3.1 formally defines the NDA and NDD problems. Section 3.2 introduces
                            the FONDUE framework in a maximally generic manner, independent of the specific NE
                            method it is applied to, or the task (NDD or NDE) it is used for. A scalable approximation
                            of FONDUE-NDA is described throughout Section 3.3, and applied to CNE as a specific
                            NE method. Section 3.4 details the FONDUE-NDD method used for NDD.
                                 Throughout this paper, a bold uppercase letter denotes a matrix (e.g., A), a bold lower
                            case letter denotes a column vector (e.g., xi ), (.)> denotes matrix transpose (e.g., A> ), and
                            k.k denotes the Frobenius norm of a matrix (e.g., k Ak).

                            3.1. Problem Definition
                                  We denote an undirected, unweighted, unlabeled graph as G = (V, E), with V =
                            {1, 2, . . . , n} the set of n nodes (or vertices), and E ⊆ (V2 ) the set of edges (or links) between
                            these nodes. We also define the adjacency matrix of a graph G , denoted A ∈ {0, 1}n×n ,
                            with Aij = 1 if {i, j} ∈ E. We denote ai ∈ {0, 1}n as the adjacency vector for node i, i.e., the
                            ith column of the adjacency matrix A, and Γ(i ) = { j | {i, j} ∈ E} the set of neighbors of i.

                            3.1.1. Formalizing the Node Disambiguation Problem
                                 To formally define the NDA problem as an inverse problem, we first need to define
                            the forward problem which maps an unambiguous graph onto an ambiguous one. This
                            formalizes the ‘corruption’ process that creates ambiguity in the graph. In practice, this
                            happens most often because identifiers of the entities represented by the nodes are not
Appl. Sci. 2021, 11, 9884                                                                                                       6 of 28

                            unique. For example, in a co-authorship network, the identifiers could be non-unique
                            author names. To this end, we define a node contraction:

                            Definition 1 (Node Contraction). A node contraction c for a graph G = (V, E) with V =
                            {1, 2, . . . , n} is a surjective function c : V → V̂ for some set V̂ = {1, 2, . . . , n̂} with n̂ ≤ n. For
                            convenience, we will define c−1 : V̂ → 2V as c−1 (i ) = {k ∈ V |c(k ) = i } for any i ∈ V̂. Moreover,
                            we will refer to the cardinality |c−1 (i )| as the multiplicity of the node i ∈ V̂.

                                   A node contraction defines an equivalence relation ∼c over the set of nodes: i ∼c j
                            if c(i ) = c( j), and the set V̂ is the quotient set V/ ∼c . Continuing our example of a co-
                            authorship network, a node contraction maps an author onto the node representing their
                            name. Two authors i and j would be equivalent if their names c(i ) and c( j) are equal, and
                            the multiplicity of a node is the number of distinct authors with the corresponding name.
                                   We can naturally define the concept of an ambiguous graph in terms of the contraction
                            operation, as follows.

                            Definition 2 (Ambiguous graph). Given a graph G = (V, E) and a node contraction c for
                            that graph, the graph Ĝ = (V̂, Ê) defined as Ê = {{c(k ), c(l )}|{k, l } ∈ E} is referred to as an
                            ambiguous graph of G . Overloading notation, we may write Ĝ , c(G). To contrast G with Ĝ , we
                            may refer to G as the unambiguous graph.

                                 Continuing the example of the co-authorship network, the contraction operation can
                            be thought of as the operation that replaces author identities with their names, which may
                            map distinct authors onto a same shared name. Note that the symbols for the ambiguous
                            graph and its set of nodes and edges are denoted here using hats, to indicate that in the
                            NDA problem we are interested in situations where the ambiguous graph is the empirically
                            observed graph.
                                 We can now formally define the NDA problem as inverting this contraction operation:

                            Definition 3 (The Node Disambiguation Problem). Given an ambiguous graph Ĝ = (V̂, Ê),
                            NDA aims to retrieve the unambiguous graph G = (V, E) and associated node contraction c, i.e., a
                            contraction c for which c(G) = Ĝ .

                                 To be more precise, it suffices to identify G up to an isomorphism, as the actual
                            identifiers of the nodes are irrelevant.

                            3.1.2. Formalizing the Node Deduplication Problem
                                 The NDD problem can be formalized as the converse of the NDA problem, also relying
                            on the concept of node contractions. First, a duplicate graph can be defined as follows:

                            Definition 4 (Duplicate graph). Given a graph G = (V, E), a graph Ĝ = (V̂, Ê) where {k, l } ∈
                            Ê ⇒ {c(k), c(l )} ∈ E for an appropriate contraction c, and where for each {i, j} ∈ E there exists
                            an edge {k, l } ∈ Ê for which c(k) = i and c(l ) = j, is referred to as a duplicate graph of G . Or
                            more concisely, using the overloaded notation from Definition 2, a duplicate graph Ĝ is a graph for
                            which c(Ĝ) = G . To contrast G with Ĝ , we may refer to G as the deduplicated graph.

                                 Continuing the example of the co-authorship network, one node in the duplicate
                            graph could correspond to two versions of the name of the same author, such that they are
                            assigned two different nodes in the duplicate graph. A contraction operation that maps
                            duplicate names to their common identity would merge such nodes corresponding to the
                            same author. Hats on top of the symbols of the duplicate graph indicate that in the NDD
                            problem we are interested in the situation where the duplicate graph is the empirically
                            observed one.
                                 The NDD problem can, thus, be formally defined as follows:
Appl. Sci. 2021, 11, 9884                                                                                               7 of 28

                            Definition 5 (The Node Deduplication Problem). Given a duplicate graph Ĝ = (V̂, Ê), NDD
                            aims to retrieve the deduplicated graph G = (V, E) and the node contraction c associated with Ĝ ,
                            i.e., for which G = c(Ĝ).

                            3.1.3. Real Graphs Suffer from Both Issues
                                 Of course, many real graphs both require deduplication and disambiguation. This is
                            particularly true for the running example of the co-authorship network. Yet, while building
                            on the common FONDUE framework, we define and study both problems separately, and
                            propose an algorithm for each in Section 3.3 (for NDA) and Section 3.4 (for NDD). For
                            networks suffering from both problems, both algorithms can be applied concurrently or
                            sequentially without difficulties, thus solving both problems simultaneously.

                            3.2. FONDUE as a Generic Approach
                                 To address both the NDA and NDD problems, FONDUE uses an inductive bias that
                            the non-corrupted (unambiguous and deduplicated) network must be easy to model using
                            NE. This allows us to approach both problems in the context of NE. Here we first formalize
                            the inductive bias of FONDUE (Section 3.2). This will later allow us to present both the
                            FONDUE-NDA (Section 3.3) and FONDUE-NDD (Section 3.4) algorithms, each tackling
                            one of the data corruption tasks (NDA and NDD, respectively).

                            The FONDUE Induction Bias
                                 Clearly, both the NDA and NDD problems are inverse problems, with NDA an ill-
                            posed one. Thus, further assumptions, inductive bias, or priors are inevitable in order to
                            solve them. The key hypothesis in FONDUE is that the unambiguous and deduplicated G ,
                            considering it is a ‘natural’ graph, can be embedded well using state-of-the-art NE methods.
                            This hypothesis is inspired by the empirical observation that NE methods embed ‘natural’
                            graphs well.
                                 NE methods find a mapping f : V → Rd from nodes to d-dimensional real vectors.
                            An embedding is denoted as X = (x1 , x2 , . . . , xn )> ∈ Rn×d , where xi , f (i ) for i ∈ V is the
                            embedding of each node. Most well-known NE methods aim to find an optimal embedding
                            XG∗ for given graph G that minimizes a continuous differentiable cost function O(G , X ).
                                 Thus, given an ambiguous graph Ĝ , FONDUE-NDA will search for the graph G , such
                            that c(G) = Ĝ for an appropriate contraction c, while optimizing the NE cost function on G :

                            Definition 6 (NE-based NDA problem). Given an ambiguous graph Ĝ , NE-based NDA aims
                            to retrieve the unambiguous graph G and the associated contraction c:

                                                      argmin O G , XG∗
                                                                        
                                                         G                                                    (1)
                                                           s.t. c(G) = Ĝ for some contraction c.

                                  Ideally, this optimization problem can be solved by simultaneously finding optimal
                            splits for all nodes (i.e., an inverse of the contraction c) that yield the smallest embedding
                            cost after re-embedding. However, this strategy requires to (a) search splits in an exponen-
                            tial search space that has the combinations of splits (with arbitrary cardinality) of all nodes,
                            (b) to evaluate each combination of the splits, the embedding of the resulting network
                            needs to be recomputed. Thus, this ideal solution is computationally intractable and more
                            scalable solutions are needed (see Section 3.3).
                                  Similarly, for NDD, given a duplicate graph Ĝ , FONDUE-NDD will search for a graph
                            G , such that c(Ĝ) = G for an appropriate contraction c, again while optimizing the NE cost
                            function on G :
Appl. Sci. 2021, 11, 9884                                                                                               8 of 28

                            Definition 7 (NE-based NDD problem). Given a duplicate graph Ĝ , NE-based NDD aims to
                            retrieve the deduplicated graph G and the associated contraction c of Ĝ :

                                                        argmin O G , XG∗
                                                                           
                                                           G                                                   (2)
                                                             s.t. c(Ĝ) = G for some contraction c.

                                 Generally speaking, to solve this optimization problem, we would want to find the
                            optimal merging for all the nodes that would reduce the cost of the embedding after
                            computing the re-embedding. Yet, a thorough optimization of this problem is beyond the
                            scope of this paper, and as an approximation we rely on a ranking-based approach where
                            we rank networks with randomly merged nodes depending on the value of the objective
                            function after re-embedding. This may be suboptimal, but it highlights the viability of the
                            concept if used for NDD as shown in the results of the experiments.
                                 Although the principle underlying both methods is thus very similar, we will see
                            below that the corresponding methods differ considerably. In common to them is the need
                            for a basic understanding of NE methods.

                            3.3. FONDUE-NDA
                                From the above section, it is clear that the NDA problem can be decomposed into
                            two subproblems:
                            1.    Estimating the multiplicities of all i ∈ Ĝ —i.e., the number of unambiguous nodes
                                  from G represented by the node from Ĝ . This essentially amounts to estimation the
                                  contraction c. Note that the number of nodes n in V is then equal to the sum of these
                                  multiplicities, and arbitrarily assigning these n nodes to the sets c−1 (i ) defines c−1
                                  and, thus, c;
                            2.    Given c, estimating the edge set E. To ensure that c(G) = Ĝ , for each {i, j} ∈ Ê there
                                  must exist at least one edge {k, l } ∈ E with k ∈ c−1 (i ) and l ∈ c−1 ( j). However, this
                                  leaves the problem underdetermined (making this problem ill-posed), as there may
                                  also exist multiple such edges.
                                    As an inductive bias for the second step, we will additionally assume that the graph
                             G is sparse. Thus, FONDUE-NDA estimates G as the graph with the smallest set E for
                            which c(G) = Ĝ . Practically, this means that an edge {i, j} ∈ Ê results in exactly one edge
                             {k, l } ∈ E with k ∈ c−1 (i ) and l ∈ c−1 ( j), and that equivalent nodes k ∼c l with k, l ∈ V
                             are never connected by an edge, i.e., {k, l } 6∈ E. This bias is justified by the sparsity of most
                            ‘natural’ graphs, and our experiments indicate it is justified.
                                    We approach the NE-based NDA Problem 6 in a greedy and iterative manner. In each
                             iteration, FONDUE-NDA identifies the node that has a split which will result in the smallest
                            value of the cost function among all nodes. To further reduce the computational complexity,
                             FONDUE-NDA only splits one node into two nodes at a time (e.g., Figure 1b), i.e., it splits
                             node i into two nodes i0 and i00 with corresponding adjacency vectors ai0 , ai00 ∈ {0, 1}n ,
                             ai0 + ai00 = ai . We refer to such a split as a binary split. Note that repeated binary splits
                             can of course be used to achieve the same result as a single split into several notes, so this
                             assumption does not imply a loss of generality or applicability. Once the best binary split
                             of the best node is identified, FONDUE-NDA splits that node and starts the next iteration.
                            The evaluation of each split requires recomputing the embedding, and comparing the
                             resulting optimal NE cost functions with each other.
                                    Unfortunately, this naive strategy is computationally intractable: computing a single
                             NE is already computationally demanding for most (if not all) NE methods. Thus, having
                             to compute a re-embedding for all possible splits, even binary ones (there are O(n2d ) of
                             them, with n the number of nodes and d the maximal degree), is entirely infeasible for
                            practical networks.
Appl. Sci. 2021, 11, 9884                                                                                                                    9 of 28

                            3.3.1. A First-Order Approximation for Computational Tractability
                                   Thus, instead of recomputing the embedding, FONDUE-NDA performs a first-order
                            analysis by investigating the effect of an infinitesimal split of a node i around its embedding
                            xi , on the cost O(Ĝsi , X̂si ) obtained after performing the splitting, with Ĝsi and X̂si referring
                            to the ambiguous graph and its embeddings’ representation, respectively, after splitting
                            node i.
                                   Drawing intuition from Figure 1, when two distinct authors share the same name
                            in a given collaboration network, their respective separate community (ego-network) are
                            lumped into one big cluster. Yet, from a topological point of view, that ambiguous node
                            (author name) is connected to both communities that are generally different, meaning they
                            share very few, if any, links. This stems from the observation that it is highly unlikely
                            that two authors with the same exact name would belong to the same community, i.e.,
                            collaborate together. Furthermore, splitting this ambiguous node into two different ones
                            (distinguishing the two authors), would ideally separate these two communities. Thus, to
                            do so, we consider that each community, that is supposed to be embedded separately, is
                            pulling the ambiguous node towards its own embedding region, and once separated, the
                            embeddings of each of the resolved nodes will be improved. So our main goal is to quantify
                            the amount of improvements in the embedding cost function by separating the two nodes
                            i0 and i00 by a unit distance in a certain direction. We propose to split the assignment of
                            the edges of i between i0 and i00 , such that all the links from i are distributed to either i0 or
                            i00 in such way to maximize the embedding cost function, which could be evaluated by
                            computing the gradient with respect to the separation distance δi .
                                   Specifically, FONDUE-NDA seeks the split of node i that will result in embedding
                            xi0 and xi00 with infinitesimal difference δ i (where δ i = xi0 − xi00 , xi0 = xi + δ2i , xi00 = xi − δ2i ,
                            and δ i → 0, e.g., Figure 1b), such that ||∇δ i O(Ĝsi , X̂si )|| is large, with ∇δ i O(Ĝsi , X̂si ) being
                            the gradient of O(Ĝsi , X̂si ) with respect to δ i . This can be completed analytically. Indeed,
                            applying the chain rule, we find:

                                                                            1                     1
                                                 ∇δ i O(Ĝsi , X̂si ) =       ∇x O(Ĝsi , X̂si ) − ∇xi00 O(Ĝsi , X̂si ).                       (3)
                                                                            2 i0                  2
                                 Many recent NE methods like LINE [24] and CNE [6], aim to embed ‘similar’ nodes in
                            the graph closer to each other, and ‘dissimilar’ nodes further away from each other (for a
                            particular similarity notion depending on the NE method). For such methods, Equation (3)
                            can be further simplified. Indeed, as such NE methods focus on modeling a property of
                            pairs of nodes (their similarity), their objective functions can be typically decomposed as
                            a summation of node-pair interaction losses over all node-pairs. For example, this can
                            be seen in Section 3.3.3 of the current paper for CNE [6], and in Equations (3) and (6)
                            of [24] for LINE. Each of these node-pair interaction losses quantifies the extent to which
                            the proximity between nodes’ embeddings reflects their ‘similarity’ in the network. For
                            methods where this decomposition is possible, we can thus write the objective function
                            as follows:

                                          O(G , X ) =         ∑           O p ( Aij , xi , x j )
                                                         j:{i,j}∈V ×V

                                                     =      ∑         O p ( Aij = 1, xi , x j ) +      ∑         O p ( Akl = 0, xk , xl ),
                                                         j:{i,j}∈ E                                 l:{k,l }∈
                                                                                                            /E

                            where O p ( Aij ,xi ,x j ) denotes the node-pair interaction loss for the nodes i and j, O p ( Aij =
                            1, xi , x j ) the part of objective function that corresponds to node i and node j with an edge
                            between them (Aij = 1) and O p ( Akl = 0, xk , xl ) is the part of objective function, where
                            node k and node l are disconnected.
Appl. Sci. 2021, 11, 9884                                                                                                                                                     10 of 28

                                 Given that Γ(i ) = Γ(i0 ) ∪ Γ(i00 ) and Γ(i0 ) ∩ Γ(i00 ) = ∅, we can apply the same decom-
                            position approach on ∇xi0 O(Ĝsi , X̂si ),

                               ∇xi0 O(Ĝsi , X̂si ) = ∇xi0                ∑            O p ( Ai0 j = 1, xi0 , x j )       +∇xi0           ∑            O p ( Ai0 l = 0, xi0 , xl )
                                                                       j:j∈Γ(i0 )                                                         / Γ (i 0 )
                                                                                                                                      l:l ∈

                                                       = ∇ xi 0           ∑            O p ( Ai0 j = 1, xi0 , x j )     +∇xi0            ∑             O p ( Ai0 l = 0, xi0 , xl )
                                                                       j:j∈Γ(i0 )                                                    l:l ∈Γ(i00 )

                                                                                                                          +∇xi0          ∑           O p ( Ai0 l = 0, xi0 , xl ),
                                                                                                                                          / Γ (i )
                                                                                                                                      l:l ∈

                            and similarly on ∇xi00 O(Ĝsi , X̂si ),

                              ∇xi00 O(Ĝsi , X̂si ) = ∇xi00               ∑            O p ( Ai00 j = 1, xi00 , x j ) +∇xi00               ∑           O p ( Ai00 l = 0, xi00 , xl )
                                                                       j:j∈Γ(i00 )                                                    l:l ∈Γ(i0 )

                                                                                                                         +∇xi00           ∑            O p ( Ai00 l = 0, xi00 , xl ).
                                                                                                                                          / Γ (i )
                                                                                                                                      l:l ∈

                                Additionally, as both nodes i0 and i00 share the same set of non-neighbors of node i,
                            we can write the following:

                                            ∇ xi 0      ∑           O p ( Ai0 l = 0, xi0 , xl ) = ∇xi00               ∑         O p ( Ai00 l = 0, xi00 , xl ).
                                                         / Γ (i )
                                                     l:l ∈                                                           / Γ (i )
                                                                                                                 l:l ∈

                                Furthermore, incorporating the previous two decompositions, we can rewrite
                            Equation (3) as follows:

                                                      1
                                                      2 j:j∈∑
                               ∇δ i O(Ĝsi , X̂si ) =                ∇xi0 O p ( Ai0 j = 1, xi0 , x j ) − ∑ ∇xi00 O p ( Ai00 l = 0, xi00 , xl )
                                                            Γ (i 0 )                                    l:l ∈Γ(i0 )

                                                     + ∑ ∇xi0 O ( Ai0 l = 0, xi0 , xl ) − ∑ ∇xi00 O p ( Ai00 j = 1, xi00 , x j )
                                                                        p
                                                                                                                                          
                                                          l:l ∈Γ(i00 )                                                j:j∈Γ(i00 )
                                                           1
                                                                     ∑              ∇xi0 O p ( Ai0 j = 1, xi0 , x j ) − ∇xi00 O p ( Ai00 l = 0, xi00 , xl )
                                                                                                                                                                        
                                                      =
                                                           2   j:j∈Γ(i0 )
                                                          1
                                                                ∑          ∇xi00 O p ( Ai00 j = 1, xi00 , x j ) − ∇xi0 O p ( Ai0 l = 0, xi0 , xl ) .
                                                                                                                                                 
                                                     −
                                                          2 l:l ∈Γ(i00 )

                                Given that Γ(i0 ){ = Γ(i ) − Γ(i0 ) = Γ(i00 ) and Γ(i00 ){ = Γ(i ) − Γ(i00 ) = Γ(i0 ), the above
                            equation could be simplified to:

                                                             1
                                                                      ∑         (−1)m ∇xi O p ( Aij = 1, xi , x j ) − ∇xi O p ( Aij = 0, xi , x j ) ,
                                                                                                                                                  
                                ∇δ i O(Ĝsi , X̂si ) =                                                                                                                               (4)
                                                             2      j:j∈Γ(i )

                            with m = 0 if j ∈ Γ(i0 ) and m = 1 if j ∈ Γ(i00 ).
                                  Let Fi1 ∈ Rd×|Γ(i)| be a matrix of which the columns correspond to the gradient vectors
                            ∇xi O p ( Aij = 1, xi , x j ) (one column for each j ∈ Γ(i )), and let Fi0 ∈ Rd×|Γ(i)| be the matrix
                            of which the columns correspond to the gradient vectors ∇xi O p ( Aij = 0, xi , x j ) (also one
                            column for each j ∈ Γ(i )). Moreover, let bi ∈ {1, −1}|Γi | be a vector with a dimension
                            corresponding to each of the neighbors j ∈ Γ(i ) of i, with value equal to 1 if that neighbor
                            is a neighbor of i0 and equal to −1 if it is a neighbor of i00 after splitting i. Then the gradient
                            Equation (4) can be written more concisely and transparently as follows:

                                                                                                            1 1
                                                                                ∇δ i O(Ĝsi , X̂si ) =       ( F − Fi0 )bi .
                                                                                                            2 i
Appl. Sci. 2021, 11, 9884                                                                                                                 11 of 28

                                The aim of FONDUE-NDA is to identify node splits for which the embedding quality
                            improvement is maximal. As argued above, we propose to approximately quantify this
                            by means of a first order approximation, by considering the two-norm squared of this
                            gradient, namely by maximizing k∇δ i O(Ĝsi , X̂si )k2 . Denoting Mi = ( Fi1 − Fi0 )> ( Fi1 − Fi0 )
                            (and recognizing that bi> bi = |Γ(i )| is independent of bi ), FONDUE-NDA can thus be
                            formalized in the following compact form:

                                                                                                  bi> Mi bi
                                                                                  argmax                        .                             (5)
                                                                              i,bi ∈{1,−1}|Γi |
                                                                                                   bi> bi

                                Note that Mi  0 for all nodes and all splits, such that this is an instance of Boolean
                            quadratic maximization problem [25,26]. This problem is NP-hard, thus it requires further
                            approximations to ensure tractability in practice.

                            3.3.2. Additional Heuristics for Enhanced Scalability
                                  In order to efficiently search for best split on a given node, we developed two approxi-
                            mation heuristics.
                                  First, we randomly split the neighborhood Γ(i ) into two and evaluate the objective
                            (Equation (5)). Repeat the randomization procedure for a fixed number of times, pick the
                            split that gives the best objective value as output.
                                  Second, we find the eigenvector v that corresponds to the largest absolute eigenvalue
                            of matrix Mi . Sort the element in vector v and assigning top k corresponding nodes to
                            Γ(i0 ) and the rest to Γ(i00 ). Evaluating the objective value for k = 1 . . . |Γ(i )| and pick the
                            best split.
                                  Finally, we combine theses two heuristics and use the split that gives the best objective
                            Equation (5) as the final split of the node i.

                            3.3.3. FONDUE-NDA Using CNE
                                We now apply FONDUE-NDA to conditional network embedding (CNE). CNE pro-
                            poses a probability distribution for network embedding and finds a locally optimal embed-
                            ding by maximum likelihood estimation. CNE has objective function:

                                                  O(G , X ) = log( P( A| X ))
                                                  = ∑ log Pij ( Aij = 1| X ) +                        ∑         log Pij ( Aij = 0| X ).       (6)
                                                          i,j:Aij =1                               i,j:Aij =0

                                 Here, the link probabilities Pij conditioned on the embedding are defined as follows:

                                                                                        PA,ij N+,σ1 (kxi − x j k)
                                       Pij ( Aij = 1| X ) =                                                                        ,
                                                                       PA,ij N+,σ1 (kxi − x j k) + (1 − PA,ij )N+,σ2 (kxi − x j k)

                            where N+,σ denotes a half-normal distribution [27] with spread parameter σ, σ2 > σ1 = 1,
                            and where PÂ,ij is a prior probability for a link to exist between nodes i and j as inferred
                            from the degrees of the nodes (or based on other information about the structure of the
                            network [28]). First, we derive the gradient:

                                               ∇xi O(G , X ) = γ ∑ (xi − x j ) P Aij = 1| X − Aij = 0,
                                                                                                
                                                                              j 6 =i

                                         1         1
                            where γ =   σ12
                                              −   σ22
                                                      .   This allows us to further compute gradient

                                                                                            γ .                      ..
                                                                                                                           
                                                                 ∇δ i O(Ĝsi , X̂si ) = −      .        xi − x j               bi
                                                                                            2 .                        .
Appl. Sci. 2021, 11, 9884                                                                                                                                              12 of 28

                                 Thus, the Boolean quadratic maximization problem has form:

                                                                                   bi> ∑k,l ∈Γ(i) (xi − xk )(xi − xl )> bi
                                                            argmax                                                                                .                            (7)
                                                         i,bi ∈{1,−1}|Γi |
                                                                                                                   bi> bi

                            3.4. FONDUE-NDD
                                  Using the inductive bias for the NDD problem, the goal is to minimize the embedding
                            cost after merging the duplicate nodes in the graph (Equation (2)). This is motivated by the
                            fact that natural networks tend to be modeled using NE methods, better than corrupted
                            (duplicate) networks, thus their embedding cost should be lower. Thus, merging (or
                            contracting) duplicate nodes (nodes that refer to the same entity) in a duplicate graph Ĝ
                            would result in a contracted graph Ĝc that is less corrupt (resembling more a “natural”
                            graph), thus with a lower embedding cost.
                                  Contrary to NDA, NDD is more straightforward, as it does not deal with the problem
                            of reassigning the edges of the node after splitting, but rather simply determining the
                            duplicate nodes in a duplicate graph. FONDUE-NDD applied on Ĝ , aims to find duplicate
                            node-pairs in the graph to combine them into one node by reassigning the union of their
                            edges, which would result in contracted graph Ĝc .
                                  Using NE methods, FONDUE-NDD aims to iteratively identify a node-pair {i, j} ∈
                            V̂cand , where V̂cand is the set of all possible candidate node-pairs, that if merged together
                            to form one node im , would result in the smallest cost function value among all the other
                            node-pairs. Thus, problem 6 can be further rewritten as:
                                                                                        
                                                                   argmin O Ĝcij , X̂cij ,                             (8)
                                                                               {i,j}∈V̂cand

                            where Ĝcij is a contracted graph from Ĝ after merging the node-pair {i, j} , and X̂cij its
                            respective embeddings.
                                 Trying this for all possible node-pairs in the graph is an intractable solution. It is not
                            obvious what information could be used to approximate Equation (8), thus we approach
                            the problem simply by randomly selecting node-pairs, merging them, observing the values
                            of the cost function, and then ranking the result. The lower the cost score, the more likely
                            that those merged nodes are duplicates.
                                 Lacking a scalable bottom-up procedure to identify the best node pairs, in the exper-
                            iments our focus will be on evaluation whether the introduced criterion for merging is
                            indeed useful to identify whether node pairs appear to be duplicates.

                            FONDUE-NDD Using CNE
                                 Similarly to the previous section, we proceed by applying CNE as a network embed-
                            ding method, the objective function of FONDUE-NDD is thus the one of CNE evaluated
                            on the tentatively deduplicated graph after attempting a merge:

                                                                                                                               σ1 1 − Pkl        1   1 d2 
                             O(Ĝcij , X̂cij ) = log( P( Âcij | X̂cij )) =                  ∑                  log 1 +
                                                                                                                               σ2 Pkl
                                                                                                                                          exp (( 2 − 2 ) kl )
                                                                                                                                                σ1  σ2 2
                                                                                       k,l: Âc            =1
                                                                                                  ij ,kl
                                                                                                                                                                               (9)
                                                                                                                      σ2 Pkl            1   1 d2 
                                                                                   +         ∑                log 1 +
                                                                                                                      σ1 1 − Pkl
                                                                                                                                 exp (( 2 − 2 ) kl ) ,
                                                                                                                                       σ2  σ1 2
                                                                                       k,l: Âc            =0
                                                                                                  ij ,kl

                            with the link probabilities Pkl conditioned on the embedding are defined as follows:

                                                                                                       PÂ
                                                                                                             cij ,kl ,kl
                                                                                                                           N+,σ1 (kxk − xl k)
                                  Pkl ( Âcij ,kl = 1| X ) =                                                                                                               .
                                                               PÂ
                                                                     cij ,kl ,kl
                                                                                   N+,σ1 (kxk − xl k) + (1 − PÂc                              ,kl )N+,σ2 (kxk   − xl k)
                                                                                                                                      ij ,kl
Appl. Sci. 2021, 11, 9884                                                                                                     13 of 28

                                Similarly to Section 3.3.3, N+,σ denotes a half-Normal distribution with spread pa-
                            rameter σ, σ2 > σ1 = 1, and where PÂ ,kl is a prior probability for a link to exist between
                                                                        cij ,kl

                            nodes k and l as inferred from the network properties.

                            4. Experiments
                                 In this section, we investigate quantitatively and qualitatively the performance of
                            FONDUE on both semi-synthetic and real-world datasets, compared to state-of-the-art
                            methods tackling the same problems. In Section 4.1, we introduce and discuss the different
                            datasets used in our experiments, in Section 4.2 we discuss the performance of FONDUE-
                            NDA, and FONDUE-NDD in Section 4.3. Finally, in Section 4.4, we summarize and discuss
                            the results. All code used in this section is publicly available from the GitHub repository
                            https://github.com/aida-ugent/fondue, accessed on 20 October 2021.

                            4.1. Datasets
                                 One main challenge for assessing the evaluation of disambiguation tasks is the scarcity
                            of availability of ambiguous (contracted) graph datasets with reliable ground truth. Further-
                            more, other studies that focus on ambiguous node identification often do not publish their
                            heavily processed dataset (e.g., DBLP datasets [16]), which makes it harder to benchmark
                            different methods. Thus, to simulate data corruption in real world datasets, we opted to
                            create a contracted graph given a source graph, and then use the latter as ground truth to
                            assess the accuracy of FONDUE compared to other baselines. To do so, we used a simple
                            approach for node contraction, for both NDA (Section 4.2.1) and NDD (Section 4.3.1).
                            Below, in Table 1 we list the details of the different datasets used after post-processing in
                            our experiments.
                                 Additionally, we also use real-world networks containing ambiguous and duplicate
                            nodes, mainly part of the PubMed collaboration network, analyzed in Appendix A. The
                            PubMed data are released in independent issues, so to build a connected network form the
                            PubMed data, we select issues that contain ambiguous and duplicate nodes. We then select
                            the largest connected component of that network. One main limitation to this dataset is
                            that not every author has an associated Orcid ID, which affects the false positive and false
                            negative labels in the network (author names that might be ambiguous would be ignored).
                            This is further highlighted in the subsequent sections.

                            4.2. Node Disambiguation
                                  In this section, we investigate the following questions: (Q1 ) Quantitatively, how does
                            our method perform in identifying ambiguous nodes compared to the state-of-the-art
                            and other heuristics? (Section 4.2.2); (Q2 ) Qualitatively, how reliable is the quality of
                            the detected ambiguous nodes compared to other methods when applied to real world
                            datasets? (Section 4.2.3); (Q3 ) Quantitatively, how does our method perform in terms of
                            splitting the ambiguous nodes? (Section 4.2.4); (Q4 ) How does the behavior of the method
                            change when the degree of contraction of a network varies? (Section 4.2.5); (Q5 ) Does
                            the proposed method scale? (Section 4.2.6); (Q6 ) Quantitatively, how does our method
                            perform in terms of node deduplication? (Section 4.3.1).

                            4.2.1. Data Processing
                                  Before conducting the experiments, the processing of the data to generate semi-
                            synthetic networks was needed. This was completed by contracting each of the thirteen
                            datasets mentioned in Table 2. More specifically, for each network G = (V, E), a graph
                            contraction was performed to create a contracted graph Ĝ = (V̂, Ê) (ambiguous) by
                            randomly merging a fraction r of total number of nodes, to create a ground truth to test
                            our proposed method. This is completed by first specifying the fraction of the nodes in
                            the graph to be contracted (r ∈ {0.001, 0.01, 0.1}), and then sampling two sets of vertices,
                            V̂ i ⊂ V̂ and V̂ j ⊂ V̂, such that |V̂ i | = |V̂ j | = br · |V̂ |c and V̂ i ∩ V̂ j = ∅. Then, every vertex
                            v j ∈ V̂ j is merged with the corresponding vertex vi ∈ V̂ i by reassigning the links connected
Appl. Sci. 2021, 11, 9884                                                                                                                       14 of 28

                                        to v j to vi and removing v j from the network. The node-pairs (vi , v j ) later serve as ground
                                        truths. We have also tested the case where the set of the candidate contracted vertices have
                                        no common neighbors (instead of a uniform selection at random). This mimics some types
                                        of social networks where two authors that share the same name, often their ego-networks
                                        do not intersect. Further analysis of the PubMed dataset Table 1, revealed that none of the
                                        ambiguous nodes shared edges with the neighbors of another ambiguous node.
                                               We have tested the performance of FONDUE-NDA, as well as that of the competing
                                        methods listed in the following section, on fourteen different datasets listed in Table 1, with
                                        their properties shown in Table 2.

                                        Table 1. The different datasets used in our experiments (Section 4.1).

                                            Datasets                                              Description
                                                                              Facebook Social Circles network [29]
                                             FB-SC                            Consists of anonymized friends list from Facebook.
                                                                               Page-Page graph of verified Facebook pages [29].
                                                                               Nodes represent official Facebook pages while the
                                             FB-PP
                                                                               links are mutual likes between pages.
                                                                         Anonymized network generated using email data from a large
                                                                         European research institution modeling the incoming and
                                              email
                                                                         outgoing email exchange between its members [29].
                                                                  A database network of the Computer Science department of the
                                                                  University of Antwerp that represent the connections between students,
                                              STD
                                                                  professors and courses [30].
                                                                              A subnetwork of the BioGRID Interaction Database
                                               PPI                            [31], that uses PPI network for Homo Sapiens.
                                                                             A network depicting the coappearance of characters in
                                             lesmis                          the novel Les Miserables [32].
                                                                    A coauthorship network of scientists working on network theory and
                                           netscience               experiments [29].
                                                              Network of books about US politics, with edges between books
                                            polbooks          representing frequent copurchasing of books by the same buyers.
                                                              http://www-personal.umich.edu/~mejn/netdata/, accessed on 20 October 2021
                                            CondMat                      Collaboration network of Arxiv Condensed Matter Physics [33]
                                              GrQc                           Collaboration network of Arxiv General Relativity [33]
                                             HepTh                  Collaboration network of Arxiv Theoretical High Energy Physics [33]
                                             CM03                        Collaboration network of Arxiv Condensed Matter till 2003 [33]
                                             CM05                        Collaboration network of Arxiv Condensed Matter till 2005 [33]
                                                               Collaboration network extracted from the PubMed database (analyzed in
                                            PubMed             Appendix A), containing 2122 nodes with ground truth of 31 ambiguous nodes
                                                               (6 of which maps to more than 2 entities) and 1 duplicate node [1]

                               Table 2. Various properties of each semi-synthetic network used in our experiments.

                              fb-sc          fb-pp            email               lesmis           polbooks            STD
    # Nodes                    4039           22,470            986                  77               105                395
    # Edges                   88,234         170,823           16,064               254               441               3423
   Avg degree                  43.7            15.2             32.6                 6.6               8.4              17.3
    Density                 1.1 × 102       6.8 × 104        3.3 × 102           8.7 × 102         8.1 × 102         4.4 × 102
                              ppi          netscience         GrQc              CondMat             HepTh              CM05                CM03
    # Nodes                    3852            379              4158               21,363             8638             36,458               27,519
    # Edges                   37,841           914             13,422              91,286            24,806           171,735              116,181
   Avg degree                  19.6             4.8              6.5                 8.5               5.7               9.4                  8.4
    Density                 5.1 × 103       1.3 × 102        1.6 × 103           4.0 × 104         6.6 × 104         2.6 × 104            3.1 × 104
Appl. Sci. 2021, 11, 9884                                                                                             15 of 28

                            4.2.2. Quantitative Evaluation of Node Identification
                                 In this section, we focus on answering Q1 , namely, given a contracted graph, FONDUE-
                            NDA aims to identify the list of contracted (ambiguous) nodes present in it. We first discuss
                            the datasets used in the experiments in the following section.

                            Baselines. As mentioned earlier in Section 1, most entity disambiguation methods in
                            the literature focus on the task of re-assigning the edges of an already predefined set of
                            ambiguous nodes, and the process of identifying these nodes in a given non-attributed
                            network, is usually overlooked. Thus, there exists very few approaches that tackle the
                            latter case. In this section, we compare FONDUE-NDA with three different competing
                            approaches that focus on the identification task, one existing method, and two heuristics.
                                 Normalized-Cut (NC) The work of [16] comes close to ours, as their method also aims
                            to identify ambiguous nodes in a given graph by utilizing Markov clustering to cluster
                            an ego network of a vertex u with the vertex itself removed. NC favors the grouping that
                            gives small cross-edges between different clusters of u’s neighbors. The result is a score
                            reflecting the quality of the clustering, using normalized-cut (NC):

                                                                   k
                                                                               W (C , C )
                                                          NC =    ∑ W (C , C ) +i Wi(C , C ) ,
                                                                  i =1     i    i           i   i

                            with W (Ci , Ci ) as the sum of all the edges within cluster Ci , W (Ci , Ci ) the sum of the for
                            all the edges between cluster Ci and the rest of the network Ci , and k being the number of
                            clusters in the graph. Although [17] also worked on identifying nodes based on topological
                            features, their method (which is not publicly available) performed worse in all the cases
                            when compared to [16], so we only chose the latter as a competing baseline.
                                  Connected-Component Score (CC) We also include another baseline, connected-component
                            score (CC), relying on the same approach used in [16], with a slight modification. Instead
                            of computing the normalized cut score based on the clusters of the ego graph of a node, we
                            account for the number of connected components of a node’s ego graph, with the node
                            itself removed.
                                  Degree Finally, we use node degree as a baseline. As contracted nodes usually tend to
                            have a higher degree, by inheriting more edges from combined nodes, degree is a simple
                            predictor for the node ambiguity.

                            Evaluation Metric. FONDUE-NDA ranks nodes according to their calculated ambiguity
                            score (how likely is that node to be ambiguous). The same process goes for NC and CC.
                            At first glance, the evaluation can be approached from a binary classification perspective,
                            by considering the top X ranked nodes as ambiguous (where X is the actual number true-
                            positive), and, thus, we can use the usual metrics for binary classification, such as F1-score,
                            precision, recall and AUC. However, this requires knowing beforehand the number of
                            true-positive, i.e., the number of actual ambiguous nodes (or setting a clear cutoff value),
                            which is only possible in labeled datasets and controlled experiments. In real world settings,
                            if FONDUE-NDA is to be used to detect ambiguous nodes in unlabeled networks, practical
                            application is rather more restricted, as it is more useful to have relevant nodes (ambiguous)
                            ranked more highly than non-relevant nodes. Thus, it is necessary to extend the traditional
                            binary classification evaluation methods, that are based on binary relevance judgments, to
                            more flexible graded relevance judgments, such as, for example, cumulative gain, which
                            is a form of graded precision, as it is identical to the precision when rating scale is binary.
                            However, as our datasets are highly imbalanced by nature, mainly because ambiguous
                            nodes are by definition a small part of the network, a better take on the cumulative gain
                            metric is needed. Hence, we employ the normalized discounted gain to evaluate our
                            method, alongside the traditional binary classification methods listed above. Below, we
                            detail each metric.
Appl. Sci. 2021, 11, 9884                                                                                          16 of 28

                                 Precision The number of correctly identified positive results divided by the number of
                            all positive results
                                                                               TP
                                                              Precision =
                                                                            TP + FP
                                 Recall The number of correctly identified positive results divided by the number of all
                            positive samples
                                                                             TP
                                                                Recall =
                                                                          TP + FN
                                F1-score It is the weighted average of the precision where an F1 score reaches its best
                            value at 1 and worst score at 0.

                                                                   2 ∗ Recall × Precision
                                                            F1 =
                                                                     Recall + Precision
                                   Note that, due to the fact that in the binary classification case, the number of false
                            positive is equal to the number of false negative, the value of the recall, precision and
                            F1-score will be the same.
                                   Area Under the ROC curve (AUC) A ROC curve is a 2D depiction of a classifier perfor-
                            mance, which could be reduced to a single scalar value, by calculating the value under the
                            curve (AUC). Essentially, the AUC computes the probability that our measure would rank
                            a randomly chosen ambiguous node (positive example), higher than a randomly chosen
                            non-ambiguous node (negative example). Ideally, this probability value is 1, which means
                            our method has successfully identified ambiguous nodes 100% of the time, and the baseline
                            value is 0.5, where the ambiguous and non-ambiguous nodes are indistinguishable. This
                            accuracy measure has been used in other works in this field, including [16], which makes it
                            easier to compare to their work.
                                   Discounted Gain (DCG) The main limitation of the previous method, as we discussed
                            earlier, is inability to account for graded scores, but rather only binary classification. To
                            account for this, we utilize different cumulative gain based methods. Given a search result
                            list, cumulative gain (CG) is the sum of the graded relevance values of all results.
                                                                        n
                                                                CG =   ∑ relevancei
                                                                       i =1

                                 On the other hand, DCG [34] takes position significance into account, and adds a
                            penalty if a highly relevant document is appearing lower in a search result list, as the
                            graded relevance value is logarithmically reduced proportionally to the position of the
                            result. Practically, it is the sum of the true scores ranked in the order induced by the
                            predicted scores, after applying a logarithmic discount. The higher the better is the ranking.
                                                                        n
                                                                              relevancei
                                                              DCG =     ∑ log
                                                                       i =1      2 ( i + 1)

                                  Normalized Discounted Gain (NDCG) It is commonly used in the information retrieval
                            field to measure effectiveness of search algorithms, where highly relevant documents being
                            more useful if appearing earlier in search result, and more useful than marginally relevant
                            documents which are better than non-relevant documents. It improves upon DCG by
                            accounting for the variation of the relevance, and providing a proper upper and lower
                            bounds to be averaged across all the relevance scores. Thus, it is computed by summing the
                            true scores ranked in the order induced by the predicted scores, after applying a logarithmic
                            discount, then dividing by the best possible score ideal DCG (IDCG, obtained for a perfect
                            ranking) to obtain a score between 0 and 1.

                                                                               NDCG
                                                                 NDCG =
                                                                               IDCG
Appl. Sci. 2021, 11, 9884                                                                                                                                      17 of 28

                                                             Evaluation pipeline. We first perform network contraction on the original graph, by fixing
                                                             the ratio of ambiguous nodes to r. We then embed the network using CNE, and compute
                                                             the disambiguation measure of FONDUE-NDA (Equation (7)), as well as the baseline
                                                             measures for each node. Then, the scores yield by the measures are compare to the ground
                                                             truth (i.e., binary labels indicating whether a node is a contracted node). This is completed
                                                             for 3 different values of r ∈ {0.001, 0.01, 0.1}. We repeat the processes 10 times using a
                                                             different random seed to generate the contracted network and average the scores. For
                                                             the embedding configurations, we set the parameters for CNE to σ1 = 1, σ2 = 2, with
                                                             dimensionality limited to d = 8.

                                                             Results. are illustrated in Figure 3 and shown in detail in Table 3 focusing on NDCG
                                                             mainly for being a better measure for assessing the ranking performance of each method.
                                                             FONDUE-NDA outperforms the state-of-the-art method, as well as non-trivial baselines
                                                             in terms of NDCG in most datasets. It is also more robust with the variation of the size of
                                                             the network, and the fraction of the ambiguous nodes in the graph. NC seems to struggle
                                                             to identify ambiguous nodes for smaller networks (Table 2). Additionally, as we tested
                                                             against multiple network settings, with randomly uniform contraction (randomly selecting
                                                             a node-pair and merging them together), or a conditional contraction (selecting a node pair
                                                             that do not share common neighbors to mimic realistically collaboration networks), we did
                                                             not observe any significant changes in the results.

                                 Table 3. Performance evaluation (NDCG) on multiple datasets for our method compared with other baselines, for two
                                 different contraction methods. Note that for some datasets with small number of nodes, we did not perform any contraction
                                 for 0.001 as the number of contracted nodes in this case is very small, thus we replaced the values for those methods by “−”.

                                   Ambiguity
                                                                10%                                    1%                                     0.1%
                                     Rate
                                    Method       FONDUE-NDA      NC      CC      Degree   FONDUE-NDA    NC      CC      Degree   FONDUE-NDA    NC      CC      Degree
                                      fb-sc          0.954       0.962   0.768   0.776       0.767      0.875   0.569   0.423       0.679      0.535   0.272   0.199
  Randomly Uniform Contraction

                                      fb-pp          0.899       0.825   0.821   0.804       0.649      0.532   0.528   0.511       0.374      0.268   0.268   0.253
                                      email          0.783       0.661   0.619   0.704       0.529      0.305   0.264   0.310         −         −       −        −
                                    student          0.778       0.664   0.568   0.652       0.396      0.328   0.235   0.257         −         −       −        −
                                     lesmis          0.906       0.570   0.499   0.622         −         −       −        −           −         −       −        −
                                   polbooks          0.972       0.604   0.534   0.698       1.000      0.310   0.267   0.318         −         −       −        −
                                       ppi           0.759       0.670   0.724   0.741       0.420      0.353   0.381   0.387       0.194      0.138   0.147   0.151
                                   netscience        0.886       0.784   0.731   0.721       0.508      0.378   0.323   0.288         −         −       −        −
                                      GrQc           0.857       0.805   0.796   0.768       0.603      0.447   0.437   0.415       0.249      0.195   0.184   0.168
                                   CondMat           0.864       0.855   0.843   0.816       0.601      0.553   0.543   0.520       0.367      0.278   0.269   0.255
                                    HepTh            0.860       0.798   0.823   0.796       0.582      0.466   0.494   0.470       0.325      0.201   0.224   0.208
                                      cm05           0.884       0.873   0.859   0.827       0.627      0.590   0.582   0.545       0.471      0.312   0.307   0.288
                                      cm03           0.888       0.869   0.852   0.823       0.635      0.577   0.562   0.534       0.335      0.297   0.281   0.272
                                      fb-sc          0.953       0.989   0.768   0.764       0.730      0.933   0.591   0.418       0.399      0.665   0.321   0.172
                                      fb-pp          0.895       0.826   0.820   0.801       0.650      0.532   0.529   0.510       0.389      0.266   0.267   0.253
  no common neighbors

                                      email          0.676       0.696   0.625   0.604       0.303      0.319   0.288   0.256         −         −       −        −
                                    student          0.659       0.726   0.531   0.587       0.368      0.447   0.201   0.229         −         −       −        −
                                     lesmis          0.755       0.591   0.498   0.486         −         −       −        −           −         −       −        −
                                   polbooks          0.981       0.620   0.544   0.696       1.000      0.268   0.420   0.363         −         −       −        −
                                       ppi           0.725       0.673   0.721   0.700       0.398      0.352   0.381   0.373       0.166      0.139   0.147   0.144
                                   netscience        0.877       0.797   0.714   0.705       0.622      0.372   0.304   0.290         −         −       −        −
                                      GrQc           0.861       0.806   0.794   0.766       0.580      0.445   0.435   0.416       0.280      0.197   0.183   0.173
                                   CondMat           0.863       0.855   0.843   0.815       0.585      0.554   0.542   0.516       0.317      0.274   0.273   0.257
                                    HepTh            0.856       0.798   0.824   0.796       0.581      0.467   0.494   0.480       0.285      0.204   0.212   0.213
                                      cm05           0.883       0.874   0.858   0.825       0.633      0.591   0.582   0.543       0.414      0.310   0.312   0.289
                                      cm03           0.884       0.869   0.853   0.822       0.651      0.577   0.561   0.533       0.439      0.295   0.279   0.271
You can also read