Semantic Hypergraphs - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Semantic Hypergraphs Telmo Menezes* 1 and Camille Roth†1,2 1 Computational Social Science Team, Centre Marc Bloch Berlin (CNRS/HU), Friedrichstr. 191, 10117 Berlin, Germany 2 CAMS (Centre Analyse et Mathématique Sociales, UMR 8557 CNRS/EHESS), 54 Bd Raspail, 75007 Paris, France Abstract 1 Introduction Natural language processing (NLP) approaches generally Approaches to Natural language processing (NLP) may belong to either one of two main strands, which also of- be classified along a double dichotomy open/opaque – ten appear to be mutually exclusive. On the one hand arXiv:1908.10784v2 [cs.IR] 18 Feb 2021 strict/adaptive. The former axis relates to the possibility we essentially have symbolic methods and models which of inspecting the underlying processing rules, the latter are open, in the sense that their internal mechanisms as to the use of fixed or adaptive rules. We argue that many well as their conclusions are easy to inspect and under- techniques fall into either the open-strict or opaque- stand, but which deal with linguistic patterns in a rela- adaptive categories. Our contribution takes steps in the tively strict manner. On the other hand, we have adap- open-adaptive direction, which we suggest is likely to tive models based on machine learning (ML) which are provide key instruments for interdisciplinary research. usually opaque to inspection and too complex for their The central idea of our approach is the Semantic Hyper- reasoning to be intelligible, but which achieve increas- graph (SH), a novel knowledge representation model that ingly impressive feats that suggest deeper understand- is intrinsically recursive and accommodates the natural ing. hierarchical richness of natural language. The SH model Presently, there is a strong research focus on the lat- is hybrid in two senses. First, it attempts to combine the ter, and for good reason. Among adaptive models, deep strengths of ML and symbolic approaches. Second, it is neural networks, for one, managed to jointly learn and a formal language representation that reduces but tol- improve performance in classic NLP tasks such as part- erates ambiguity and structural variability. We will see of-speech tagging, chunking, named-entity recognition, that SH enables simple yet powerful methods of pattern and semantic role-labeling [as early as 19]. In other detection, and features a good compromise for intelligi- cases, modern ML enabled methods that did not ex- bility both for humans and machines. It also provides ist before e.g., estimation of semantic similarity using a semantically deep starting point (in terms of explicit word embeddings [45]. More recently, Bidirectional En- meaning) for further algorithms to operate and collabo- coder Representations from Transformers (BERT) have rate on. We show how modern NLP ML-based building shown that pre-trained general models can be fine- blocks can be used in combination with a random for- tuned to achieve state-of-the-art performance in specific est classifier and a simple search tree to parse NL to SH, language understanding tasks such as question answer- and that this parser can achieve high precision in a diver- ing and language inference [21]. Nonetheless, symbolic sity of text categories. We define a pattern language rep- methods possess several proper and important features, resentable in SH itself, and a process to discover knowl- namely that they can offer human-readable knowledge edge inference rules. We then illustrate the efficiency of representations of knowledge, as well as language under- the SH framework in a variety of tasks, including con- standing through formal and inspectable rule-based log- junction decomposition, open information extraction, ical inference. concept taxonomy inference and co-reference resolu- tion, and an applied example of claim and conflict anal- Why do we observe this apparent trade-off between ysis in a news corpus. openness and adaptivity? Initial approaches to NLP were of a symbolic nature, based on rules written by hand, or in algorithms akin to the ones that are used for pro- Keywords: natural language understanding; knowledge gramming language interpreters and compilers, such as representation; information extraction; inference sys- recursive descent parsers. It became apparent that the tems; explainable artificial intelligence; hypergraphs diversity of grammatical constructs that can be found in natural language is too large to be tackled in such a way. The problem is compounded by the frequent use of un- * menezes@cmb.hu-berlin.de ( ) grammatical constructs that are nevertheless frequent in † roth@cmb.hu-berlin.de real-world language usage (e.g. simple mistakes, neol- 1
ogisms, slang). In other words, content in natural lan- ual words, and has more refined elaborations, e.g. with guage is generated by actors that are much more com- Bayesian regularization [47]. Other methods preserve plex, and also more error-prone and error-tolerant than the level of words: such is the case with term and pattern conventional algorithms. ML is a natural fit for this type extraction (i.e., discovering salient words through the use of problem and, as we mentioned, vastly surpasses the of helper measures like term frequency–inverse docu- capabilities of human-created symbolic systems in a va- ment frequency (TF-IDF) [60]), so-called “Named Entity riety of tasks. Recognition” [50] (used to identify people, locations, or- We suggest however that there is a “hidden rela- ganizations, and other entities mentioned in corpuses, tionship” between explicit symbolic manipulation rules for example in news corpora [22] or Twitter streams [57]) and modern ML: the latter can be seen as a form of and ad-hoc uses of conventional computer science ap- “automatic programming” through large-scale statisti- proaches such as regular expressions to identify chunks cal learning processes, that amount to the generation of of text matching against a certain pattern (for example, highly complex programs through adaptive pressure in- extracting all p-values from a collection of scientific arti- stead of human programmers’ efforts. It does not matter cles [17]). Another strand of approaches operates at the if it is gradient descent on a multi-layered network topol- level of word sets, including those geared at topic detec- ogy, or something more prosaic like entropy reduction tion (such as co-word analysis [37], Latent Dirichlet Allo- in a decision tree, it is still program generation through cation (LDA) [12] and TextRank [44], used to extract the adaptation. The capability of these methods to generate topics addressed in a text) or used for relationship ex- such complex programs is what allows them to tackle the traction (meant at deriving semantic relations between complexities of NL, but it is also this very complexity that entities mentioned in a text, e.g., is(Berlin, City)) [4]. Re- makes them opaque. cent advances in embedding techniques have also made it possible to describe topics extensionally as clusters of We can thus imagine a double dichotomy documents in some properly defined space [5, 33]. open/opaque – strict/adaptive. We argue that existing approaches generally fall into either the open-strict or Overall, these techniques provide useful approaches opaque-adaptive categories. A few approaches have to analyze text corpora at a high level, for example, with ventured into the open-adaptive domain [7, 40] and our regard to their main entities, relationships, sentiment, contribution aims at significantly expanding this direc- and topics. However, there is limited support to detect, tion. Before discussing our approach, let us consider for instance, more sophisticated claim patterns across why open-adaptive is a desirable goal. The work we a large volume of texts, what recurring statements are present here was performed in the context of a computa- made about actors or actions, and what are the qual- tional social science (CSS) research team, where NLP is a itative relationships among actors and concepts. This scientific instrument capable of assisting in the analysis type of goal, for example, extends semantic analysis to a of text corpora that are too vast for humans to study socio-semantic framework [58] which also takes into ac- in detail. We argue that further progress in the study count actors who make claims or who are the target of of socio-technical systems and their dynamics could claims [22]. be enabled by open-adaptive scientific instruments for It is also particularly interesting to consider the language understanding. model of knowledge representation that is implicitly In current CSS research, the more common ap- or explicitly associated with the various NLP/text min- proaches aim to transform natural language documents ing/information extraction approaches. To illustrate, into structured data that can be more easily analyzed on one extreme we can consider traditional knowledge by scholars and are referred to by a variety of umbrella bases and semantic graphs, which are open in our sense, terms such as “text mining” [69], “automated text anal- but also limited in their expressiveness and depth. On ysis” [29] or “text-as-data methods” [77]. They exhibit the other, we have the extensive knowledge opaquely en- a wide range of sophistication, from simple numerical coded in neural network models such as BERT or GPT- statistics to more elaborate ML algorithms. Some meth- 2/3 [e.g. 15]. Beyond the desirability of open knowledge ods indeed rely essentially on scalar numbers, for in- bases for their own sake, we propose that a language rep- stance by measuring text similarity (e.g., with cosine dis- resentation that is convenient for both humans and ma- tance [64]) or attributing a valence to text, as in the case chines can constitute a lingua franca, through which sys- of ideological estimation [63] or sentiment analysis [54], tems of cognitive agents of different natures can coop- which in practice may be used to appraise the emotional erate in a way that is understandable and inspectable. content of a text (anger, happiness, disagreement, etc.) Such systems could be used to combine the strengths of or public sentiment towards political candidates in so- symbolic and statistical inference. cial media [76]. Similarly, political positions in docu- The central idea of our approach is the Semantic Hy- ments may be inferred from so-called “Wordscores” [39] pergraph (SH), a novel knowledge representation model – a popular method in political science that also relies that is intrinsically recursive and accommodates the nat- on the summation of pre-computed scores for individ- ural hierarchical richness of NL. The SH model is hybrid 2
in two senses. First, it attempts to combine the strengths mantic information that is lost in the graphic represen- of ML and symbolic approaches. Second, it is a formal tation, for example the ability to express n-ary relation- language representation that reduces but tolerates ambi- ships, propositions about propositions and constructive guity, and that also reduces structural variability. We will definitions of concepts. see that SH enables simple methods of pattern detec- A further type of approaches relying on knowledge tion to be more powerful and less brittle, that it is a good bases is epitomized by the famous Cyc [36] project, a compromise for intelligibility both for humans and ma- multi-decade enterprise to build a general-purpose and chines, and that it provides a semantically deeper start- comprehensive system of concepts and rules. It is an im- ing point (in terms of explicit meaning) for further algo- pressive effort, nevertheless hindered by the limitations rithms to operate and collaborate on. that we alluded to in the previous section concerning In the next section we discuss the state of the art, the ambiguity and diversity of semantic structures con- comparing SH to a number of approaches from various tained in NL, given that it relies purely on symbolic rea- fields and eras. We then describe the structure and syn- soning. Cyc belongs to a category of systems that are tax of SH, followed by an explanation on how modern mostly concerned with question answering, a different and standard NLP ML-based building blocks provided aim that the one of the work that we propose here, which by an open source software library [31] can be used in is more concerned with aiding in the analysis and sum- combination with a random forest classifier and a sim- marization of large corpora of text for research purposes, ple search tree to parse NL to SH. Here we also pro- especially in the social sciences, while not requiring full vide precision benchmarks of our current parser, which disambiguation of meaning nor perfect reasoning or un- is then employed in the experiments that follow. We at- derstanding. tempted to perform a set of experiments of a rather di- Several other notable knowledge bases of a similar se- verse nature, to gather evidence of SH usefulness in a mantic graph nature have been developed, some relying variety of roles, and of its potential to tackle the chal- on collaborative human efforts to gather ground asser- lenge that we started by stating in this introduction, and tions, for example MIT’s ConceptNet [68], ATOMIC [61] to gather empirical insights. One important language from the Allen Institute, or very rigorous scholarly ef- understanding task is information extraction from text. forts of annotation, as is the case with WordNet [46] One formulation of such a task that attracts significant and its multiple variants, or relying on wiki-like plat- attention is that of Open Information Extraction (OIE) forms such as WikiData [75], or mining relationship — the domain-free extraction from text of tuples (typ- from Wikipedia proper, as is the case with DBPedia [6], ically triplets) representing semantic relationships [24]. and more recently a transformer language model has We will show that a small and simple set of SH patterns been proposed to automatically extend common-sense can produce competitive results in an OIE benchmark, knowledge bases [14]. We envision that such general- when pitted against more complex and specialized sys- knowledge bases could be fruitfully integrated with SHs tems in that domain. We will demonstrate concept tax- for various purposes, but such endeavours are beyond onomy inference and co-reference resolution, followed the scope of this work. We are instead interested in by claim and conflict identification in a database of news demonstrating what can be achieve by going beyond headers. We will show how SH can be used to generate such non-hypergraphic appraches. semantically rich visual summaries of text. Hypergraphic approaches to knowledge representa- 2 Related Work tion. Hypergraphs have been proposed already in the 1970s as a general solution for knowledge representa- Knowledge bases. As a knowledge representation for- tion [13]. More recently, Ben Goertzel produced simi- malism, it is interesting to compare SH with traditional lar insights [28], and in fact included an hypergraphic approaches. Let us start with triplet-based ones. For database called AtomSpace as the core knowledge rep- example, the Semantic Web [10, 62] community tends resentation of his OpenCog framework [30], an attempt to use standards such as RDFa [1], which represent to make Artificial General Intelligence emerge from the knowledge as subject-predicate-object expressions, and interaction of a collection of heterogeneous system. As are conceptually equivalent to semantic graphs [3, 66] is the case with Cyc, the goals of OpenCog are however (similarly, a particular type of hypergraph has been used quite distinct from the aim of our work. in [16] to represent tagged resources by users, yet this A model that shares similarities with ours but purely also reduces to fixed triplet conceptualization). Despite aims at solving a meaning matching problem is that their usefulness for simple cases, such approaches can- of Abstract Meaning Representation (AMR) [7]. AMR is not hope to match the semantic sophistication of what based on PropBank verbal propositions and their ar- can be conveyed with open text. Binary relationships guments [53], ensuring that all such meaning struc- and lack of recursion limit the expressive power of se- tures can be represented. SH completeness is based in- mantic graphs, and we sill see how SHs can represent se- stead on Universal Dependencies [52], ensuring instead 3
that all cataloged grammatical constructs can be repre- on 8 types) is much simpler than the diversity of gram- sented. AMR’s goal is to purely abstract meaning, while matical roles contained in a typical set of dependency la- SH accommodates the ambiguity of the original NL ut- bels (such as Universal Dependencies), and we will also terances, bringing several important benefits: it makes provide empirical evidence that SHs are not isomorphic their computational processing tractable in further ways, to DPTs. tolerates mistakes better and preserves communication In the realm of OIE, one approach in particular with nuance that would otherwise be lost. Furthermore, it which our work shares some similarities is that of learn- remains open to structures that may not be currently ing open pattern templates [40]. These pattern templates envisioned. Parsing AMR to NL is a particularly hard combine at the same symbolic level dependency parse task and, to our knowledge, there is currently no parser labels and structure, part-of-speech tags, explicit lexical that approaches the capabilities of what we will demon- constraints and higher-order inferences (e.g. that some strate in this work. In part, this is a practical problem: term refers to a person), to achieve sophisticated lan- we will see how we can take advantage of intermediary guage understanding in the extraction of OIE tuples, be- NLP tasks that are well studied and developed to achieve ing able to extract relations that are not only of a verbal NL to SH parser. Doing the same for AMR requires the nature, and demonstrating sensitivity to context. The construction of training data by extensive annotation ef- work we will present does not attempt to directly com- forts by humans. It could be argued that this is still a bine diverse linguistic features at the service of a spe- preferable goal, no matter how distant, given that AMR cific language understanding task. Instead, we propose removes all ambiguity from statements. Here we point to use such features to aid in the translation of NL into out that this aspect of AMR is also a downside, firstly be- a structured representation, which relies by comparison cause it makes all failures of understanding catastrophic on a very simple and uniform type system, and from (we will see how this is not the case for SH), and secondly which complex NL understanding tasks become easier, because NL is inherently ambiguous. It is often the case and that is of general applicability to a diversity of such that even human beings cannot fully resolve ambigui- tasks, while remaining fully readable and understand- ties, or that an ambiguous statement gains importance able by humans. Furthermore, it defines a system of later on, with more information. We aim to define SH as knowledge representation in itself, that is directly fo- a lingua franca for the collaboration of human an algo- cused on meaning instead of grammar. rithmic actors of several natures, a less rigid goal than the one embodied by AMR. Text mining. We have already covered in the previ- ous section the most commonly used text mining ap- Free text parsing. A classical NLP task is that of mak- proaches, while emphasizing the relative lack of sophis- ing explicit the grammatical structure of a sentence in tication in understanding text meaning. The need for the form of a parse tree. A particularly common type of such sophistication is all the more pregnant for social such a tree in current use is the Dependency Parse Tree sciences. On the one hand, qualitative social science (DPT), based on dependency grammars. We will see that methods of text analysis do not scale to the enormous our own parser takes advantage of DPTs (among other datasets that are now available. Furthermore, quantita- high-level grammatical / linguistic features) as interme- tive approaches allow for other types of analysis that are diary steps, but it is also interesting to notice that DPTs enriching and complementary to qualitative research, themselves can be considered as a type of hypergraphic yet may simplify extensively the processing in such a way representation of language [56]. In fact, as we will discuss that it hinders their adoption by scholars used to the re- below, they are already employed in various targeted lan- finement of qualitative approaches. And the more so- guage understanding tasks in a CSS context. phisticated the NLP techniques become, the further they From the perspective of hypergraphic representation tend to be from being used for large-scale text analy- of language, the fundamental difference between DPTs sis purposes. Indeed, these systems are fast and accu- and SHs is that the former aims at expressing the gram- rate enough to form a starting point for more advanced matical structure of language, while the latter its seman- computer-supported analysis in a CSS context, and they tic structure, in the simplest possible way that enables enable approaches that are substantially more sophis- meaning extraction in a principled and predictable way. ticated than the text mining state of the art discussed In contrast to the ad-hoc nature of information extrac- above. Yet, the results of such systems may seem rela- tion from DPTs, we will see that SHs structure NL in a way tively simplistic compared to human-level understand- akin to functional computer languages, and allow for ex- ing of natural language. ample for a generic methodology of extracting patterns. The literature already features some works which at- The expressive power of such patterns will be demon- tempt at going beyond language models based on word strated in several ways, namely by demonstrating com- distributions (such as bags of words, co-occurrence clus- petitive results in a standard Open Information Extrac- ters, or so-called “topics”) or triplets. For instance, State- tion task. We will see that the type system of SHs (relying ment Map [49] is aimed at mining the various viewpoints 4
expressed around a topic of interest in the web. Here lowing for concepts constructed from other concepts as a notion of claim is employed. A statement provided well as statements about statements, and on the other by the user is compared against statements from a cor- hand, it can express n-ary relationships. We will see how pus of text extracted from various web sources. Text a hypergraphic formalism provides a satisfactory struc- alignment techniques are used to match statements that ture for NL constructs. are likely to refer to the same issue. A machine learn- While a graph G = (V, E ) is based on a vertex set V ing model trained over NLP-annotated chunks of text and an edge set E ⊂ V × V describing dyadic connec- classifies pairs of claims as “agreement”, “conflict”, “con- tions, a hypergraph [8, 9] generalizes such structure by finement” and “evidence”. More broadly, the subfield allowing n-ary connections. In other words, it can be de- of argumentation mining [38] also makes extensive use fined as H = (V, E ), where V is again a vertex set yet E of machine learning and statistical methods to extract is a set of hyperedges (e i )i ∈1..M connecting an arbitrary portions of text corresponding to claims, arguments and number of vertices. Formally, e i = {v 1 , ...v n } ∈ E = P (V ). premises. These approaches generally rely on surface We further generalize hypergraphs in two ways: hyper- linguistic features, there is however an increasing trend edges may be ordered [23] and recursive [32]. Ordering of dealing with structured and relational data. Already in entails that the position in which a vertex participates 2008, [73] proposed a system to extract binary semantic in the hyperedge is relevant (as is the case with directed relationships from Dutch newspaper articles. A recent graphs). Recursivity means that hyperedges can partici- work [59] presents a system aimed at analysing claims in pate as vertices in other hyperedges. The corresponding the context of climate negotiations. It leverages depen- hypergraph may be defined as H = (V, E ) where E ⊂ E V dency parse trees and general ontologies [70] to extract the recursive© set of all possible hyperedges generated by V : E V = (e i )i ∈{1..n} | n ∈ N, ∀i ∈ {1..n}, e i ∈ V ∪ E V . In ª tuples of the form: 〈actor, predicate, negotiation_point〉 where the actors are stakeholders (e.g., countries), the this sense, V configures a set of irreducible hyperedges predicates express agreement, opposition or neutrality of size one i.e., atomic hyperedges which we also de- and the negotiation point is identified by chunk of text. note as atoms, similarly to semantic graphs. From here Similarly, in another recent work [74], parse trees are on, we simply call these recursive ordered hyperedges as used to automatically extract source-subject-predicate “hyperedges”, or just “edges”, and we denote the corre- clauses in the context of news reporting over the 2008- sponding hypergraph as a “semantic hypergraph”. 2009 Gaza war, and used to show differences in citation Let us consider a simple example, based on a set V and framing patterns between U.S. and Chinese sources. made of four atoms: the noun “(berlin)”, the verb “(is)”, These works help demonstrate the feasibility of using the adverb “(very)” and the adjective “(nice)”. They may parse trees and other modern NLP techniques to iden- act as building blocks for both hyperedges “(is berlin tify viewpoints and extract more structured claims from nice)” and “(very nice)”. These structures can further be text. Being a step forward from pure bag-of-words analy- nested: the hyperedge “(is berlin (very nice))” represents sis, they still leave out a considerable amount of informa- the sentence “Berlin is very nice”. It illustrates a basic tion contained in natural language texts, namely by rely- form of recursivity. ing on topic detection, or on pre-defined categories, or on working purely on source-subject-predicate clauses. 3.2 Syntax We propose to introduce a more sophisticated language model, where all entities participating in a statement are In a general sense, the hyperedge is the fundamental uni- identified, where entities can be described as combina- fying construct that carries information within the SH tions of other entities, and where statements can be enti- formalism. We further introduce the notion of hyper- ties themselves, allowing for claims about claims, or even edge types, which simply describe the type of construct claims about claims about claims. The formal backbone that some hyperedge represents: for instance, concepts, of this model consists of an extended type of hypergraph predicates or relationships, as in the above examples — that is both recursive and directed, thus generalizing se- respectively (berlin), (is) and (is berlin nice). We exten- mantic graphs and inducing powerful representation ca- sively detail hyperedge types and their role in the next pabilities. subsections. For now, it is enough to know that predi- cates, in particular and for instance, belong to a larger family of types that are crucial for the construction of hy- 3 Semantic hypergraphs – structure peredges and that we call connectors. In this regard, se- mantic hypergraphs rely on a syntactic rule that is both and syntax simple and universal: the first element in a non-atomic hyperedge must be a connector. 3.1 Structure In effect, a hyperedge represents information by com- The SH model is essentially a recursive, ordered hyper- bining other (inner) hyperedges that represent informa- graph that makes the structure contained in natural lan- tion. The purpose of the connector is to specify in which guage (NL) explicit. On one hand, NL is recursive, al- sense inner hyperedges are connected. Naturally, it can 5
be followed by one or more hyperedges which play the As we shall see, these machine-oriented codes remove role of arguments with respect to the connector. As hy- ambiguity, facilitate automatic inference and computa- peredges, if they are not atoms, they must also start with tions. The full list of types as well as their codes and pur- a connector themselves, in a recursive fashion. poses can be seen in table 1. We illustrate this on the hyperedge (is berlin (very nice)): here, (is) is a predicate playing the role of con- Connectors The second and last role that atoms can nector while (berlin) and (very nice) are arguments of the play is the role of connector. We then have five types of initial hyperedge. (berlin) is an atomic hyperedge, while connectors, each one with a specific function that relates (very nice) is a hyperedge made of two elements: the con- to the construction of specific types of hyperedges. nector, (very), an atomic hyperedge, and an argument, The most straightforward connector is the predicate, (nice), also an atomic hyperedge. Both cannot be decom- whose code is “P”. It is used to define relations, which are posed further. frequently statements. Let us revisit a previous example Readers who are familiar with Lisp will likely with types: have noticed that hyperedges are isomorphic to S- (is/P berlin/C nice/C) expressions [42]. This is not purely accidental. Lisp is very close to λ-calculus, a formal and minimalist The predicate (is/P) both establishes that this hyperedge model of computation based on function abstraction is a relation between the entities following it, and gives and application. The first item of an S-expression meaning to the relation. This is isomorphic to typical specifies a function, the following ones its arguments. knowledge graphs [3, 66] where (berlin) and (nice) would One can think of a function as an association between be connected by an edge labeled with (is). objects. Albeit hyperedges do not specify computations, connectors are similar to functions at a very abstract The modifier type (“M”) applies to one (and only one) level, in that they define associations. The concepts of existing hyperedge and defines a new hyperedge of the “race to space” and “race in space” are both associated to same type. In practice, as the name indicates, it modi- the concepts “race” and “space”, but the combination of fies things and can be applied to concepts, predicates or these two concepts yields different meaning by applica- other modifiers, and also to triggers, a type that we will tion of either the connector “in” or “to”. For this reason, subsequently address. For concepts, a typical case is ad- λ-calculus has also been applied to dependency parse jectivation, e.g.: trees in the realm of question-answering systems [56]. (nice/M shoes/C) Note here that “nice” is being considered as a modifier, 3.3 Types while “nice” was a concept in the previous case: this is We now describe a type system that further clarifies the due to the fact that (nice/M) and (nice/C) refer to two role each entity plays in a hyperedge. In all, we distin- distinct atoms which share the same human-readable la- guish 8 types, the smallest set we could find that appears bel, “nice”. To illustrate modification of predicates, let us to cover virtually all possible information representation revisit a previous example, but suppose that we declare roles cataloged in the Universal Dependencies. We first that Berlin is not nice. Then we can apply a modifier to present the types that atoms may have and discuss their the predicate, such as (not/M), so that: use in constructing higher-order entities. We then show ((not/M is/P) berlin/C nice/C) how hyperedge types are recursively inferable from the types of the connector and subsequent arguments. Finally, modifiers may modify other modifiers: ((very/M nice/M) shoes/C) Atomic concepts. The first, simplest and most funda- mental role that atoms can play is that of a concept. This The builder type (“B”) combines several concepts to cre- corresponds to concepts that can be expressed as a sin- ate a new one. For example, atomic concepts (capital/C) gle word in the target language, for example “apple”; they and (germany/C) can be combined with the builder atom are labeled by this human-readable string, as could be (of/B) to produce the concept of “capital of Germany”: guessed from the previous subsection. (of/B capital/C germany/C) This defines an eponymous type, “concept”. The nomenclature we propose further indicates the type of A very common structure in English and many other lan- an atom by appending a more machine-oriented code guages is that of the compound noun e.g., “guitar player” after this label and a slash (/). For concepts, this code or “Barack Obama”. To represent these cases, we intro- is “C”: duce a special builder atom that we call (+/B). Unlike what we have seen so far, this is an atom that does not (apple/C) correspond to any word, but indicates that a concept is 6
Code Type Purpose Example Atom Non-atom C concept Define atomic concepts apple/C × × P predicate Build relations (is/P berlin/C nice/C) × × M modifier Modify a concept, predicate, modifier, (red/M shoes/C) × × trigger B builder Build concepts from concepts (of/B capital/C germany/C) × T trigger Build specifications (in/T 1994/C) × J conjunction Define sequences of concepts or rela- (and/J meat/C potatoes/C) × tions R relation Express facts, statements, questions, or- (is/P berlin/C nice/C) × ders,... S specifier Relation specification (e.g. condition, (in/T 1976/C) × time,...) Table 1: Hyperedge types with use purposes and examples. Connector types are emphasized with a gray background. The rightmost columns specify whether this type may be encountered in atomic or non-atomic hyperedges. formed by the compound of its arguments; it is neces- argument, and the hyperedge in which they participate sary to render such compound structures. The previous has the type of the single argument of the modifier. For examples can be represented respectively as (+/B gui- example, the hyperedge (northern/M germany/C) is a con- tar/C player/C) and (+/B barack/C obama/C). cept (C), and (not/M is/P) is a predicate (P). Table 2 lists all type inference rules and their re- Conjunctions (“J”), like the English grammatical con- spective requirements. They also induce syntactic con- struct of the same name, join or coordinate concepts or straints which close the SH type system. relations: We may now introduce the two last types of our type (and/J meat/C potatoes/C) system, relation (R) and specifier (S), which only concern (but/J (likes/P mary/C meat/C) (hates/P potatoes/C)) non-atomic hyperedges: they are always defined as the result of a composition of hyperedges. Relations are typ- We also introduce a special conjunction symbol, (:/J), ically used to state some fact (even though they can also to denote implicit sequences of related concepts. For be used to represent questions, orders and other things). example, the phrase: “Freud, the famous psychiatrist”, (is/P Berlin/C nice/C) is an obvious example of relation. would be represented as: In our context, they thus turn out to be a crucial hyper- edge type. Specifiers are types that play a more peripheral (:/J freud/C (the/M (famous/M psychiatrist/C))) role, in the proper sense, in that they are supplemental to relations. Specifiers are produced by triggers. For ex- The remaining case, triggers (T), concerns additional ample, the trigger “(in/T)” can be used to construct the specifications of a relationship, for example conditional specification: (in/T 1976/C). Specifications, as the name (“We go if it rains.”), or temporal (“John and Mary trav- implies, add precisions to relations e.g., when, where, eled to the North Pole in 2015”), local (“Pablo opened a why or in which case something happened. bar in Spain”), etc.: (opened/P pablo/C (a/M bar/C) (in/T spain/C)) 3.4 Argument roles We introduce a last notion that we employ to make Hyperedge type inference. Atomic types are entirely meaning more explicit: argument roles for builders and covered by these six types, of which three exclusively predicates. They are represented as sequences of char- concern atoms (builders, triggers and conjunctions). We acters that indicate the role of the respective arguments already hinted at the fact that non-atomic hyperedges following such connectors. also have types. These are implicit and inferable from the types of the connector and its arguments. Given, for ex- ample, that (germany/C) is an atom of type concept (C), Concept builders. Given a concept hyperedge, a key the hyperedge (of/B capital/C germany/C) is also a con- issue is that of inferring its main concept, i.e. the con- cept, and this can be inferred from the fact that its con- cept that can be assumed to be its hypernym. Beyond nector is of type builder (B). Builders need to be followed the simple case of atoms, concept hyperedges may only by at least two concepts. Modifiers (M) only accept one be formed by connectors that are either modifiers or 7
Element types → Resulting type Role Code (M x) x active subject s (B C C+) C passive subject p (T [CR]) S agent (passive) a (P [CRS]+) R subject complement c (J x y’+) x direct object o indirect object i Table 2: Type inference rules. We adopt the notation parataxis t of regular expressions: the symbol + is used to de- interjection j note one or more entities with the type that precedes it, specification x while square brackets indicate several possibilities (for relative relation r instance, [CR]+ means “at least one of any of both C or R” types). x means any type: (M x) is of type x. Table 3: Predicate argument roles. builders. When the connector is a modifier, finding the hypernym is admittedly trivial. When the connector is due to the flexibility of NL in this regard, and to the fact a builder, it is often possible to infer the main concept that the presence of a certain role after a predicate is of- among the arguments. There are only two possible roles: ten optional. “main” (denoted by m) and “auxiliary” (denoted by a). There are admittedly more possible roles than for For example: builders. They are shown in table 3. Once again, this set is the result of an effort to cover all grammatical cases (+/B.am tennis/C ball/C) listed in the Universal Dependencies in the most suc- cinct way possible. Most of them (in fact, the first 8 in the The argument role annotation “.am” indicates that ball/c table) directly correspond to generic grammatical roles is the main concept in the construct, meaning that of the same name. Of these, the first 6 are by far the (+/B.am tennis/C ball/C) is a sort of ball/c — the main most frequent. Specifications were already discussed in concept is a hypernym of the whole construct. the previous subsection (3.3), and their purpose as hy- With compound nouns ((+/B) builder), we simply peredges coincides with their role when participating in make use of part-of-speech and dependency labels to in- relations: as an additional specification to the relation fer the main concept. Another common situation where (temporal, conditional, etc.). Finally, a relative relation finding roles is quite trivial is the case of builders de- is a nested relation, that acts as a building block of the rived from a proposition, such as (of/B), which express outer relation that contains it. We will make extensive a relationship between the arguments. For example, in use of this later, to identify what is being claimed by a (of/B.ma capital/C germany/C), the main concept is (cap- given actor. ital/C). “Capital of Germany” is thus a type of capital. In English and many other languages, it is always the case that the main concept is the first argument after a builder derived from a proposition. 4 Translating NL into SH We now discuss the crucial task of translating NL into Predicates. Predicates can induce specific roles that this SH representation. This can, of course, be framed the following arguments play in a relation. The need for as a conventional supervised ML task. A difficulty arises argument roles in relations arises from cases where the from the lack of training data. SH is a novel repre- role cannot be inferred from the type of the argument. sentation, and the effort necessary to annotate a suffi- For example, the same concept could participate in a re- ciently large amount of text to train an NL to SH transla- lation as a subject or as an object. Consider for instance tor from scratch is far from trivial. We were motivated the sentence “John gave Mary a flower”, represented as: to look for an alternative, and we hypothesized that it would be much easier to infer the SH representation (gave/P.sio john/C mary/C (a/M flower/C)) from grammatically-enriched representations than from In this relation, the argument role string “sio” indicates raw text. We will show that this indeed appears to be the that the three arguments following the predicate respec- case. tively play the roles of subject, indirect object and direct We propose a two-staged approach. The first (α-stage) object. This relation involves three concepts united by is a classifier that assigns a type to each token in a given the predicate that represents the act of giving, but with- sentence. The second (β-stage) is a search tree-based al- out the argument roles, who the giver is, who the receiver gorithm that recursively applies the rules in table 2 to is, and what object is being given, would remain unde- impose the hypergraphic structure on the sequence of fined. Relying on ordering would not be enough, both atoms produced by the α-stage. This restricts the ML 8
part of the process to the α-stage, making it a trivial clas- are reported in [67] to be 0.97 for the fine grained part-of- sification problem. speech tagger (i.e., guessing the OntoNotes tag), 0.92 for unlabeled dependencies (i.e., guessing the head of each token) and 0.90 for labeled dependencies (i.e., the head 4.1 α-stage and the label). The classification categories correspond to the set of the Let us refer to the former as TAG, and to the latter as six atomic types shown in table 1, with one additional POS. We can also consider the most common words in category for tokens that should be discarded (typically the corpus. We consider as features the sets of 15, 25, 50 punctuation). The open question is the feature set. We and 100 most common words (WORD15, WORD25 and will see how, operating on the previous assumption re- so on). Further features indicate if a token corresponds garding grammatical annotation, we use spaCy1 – a pop- to some punctuation symbol, if it is at the root of the de- ular NLP tool – to generate appropriate features. pendency parse trees, if it has left or right children in this Using this library we perform segmentation of text same tree, and finally its shape in terms of capitalization into sentences, followed by tokenization and annota- (e.g. the shape of the word “Alice” is Xxxxx). Then, we tion of tokens with parts-of-speech, dependency labels establish three types of relative tokens: the ones that ap- and named entity categories. In short, we deploy the pear directly after or before the current one in the sen- full arsenal of off-the-shelf NLP tasks that come avail- tence, if they exist, and the one that is the parent of the able with spaCy. In this work we restrict ourselves to the current one in the dependency parse tree, if it exists. English language and we use the “en_core_web_lg-2.0.0” For each one of these tokens, all the previous features language model. are also applied (for example, the UD part-of-speech of We collected randomly selected texts in English from the dependency head is HPOS, and the part-of-speech of five categories: fiction (5 books, 87738 sentences) and the subsequent word in the sentence is POS_AFTER). We non-fiction books (5 books, 51597 sentences), news (10 thus have 33 candidate features in total. All of these fea- articles, 532 sentences), scientific articles (10 articles, tures are categorical, and we employ one-hot encoding 3467 sentences) and Wikipedia articles (10 articles, 2888 to feed them to the decision trees. sentences). From these we selected 60 random sen- tences in each category, thus a total of 300 sentences rep- Feature selection. We tested two approaches for fea- resenting 6936 tokens. An interactive computer script ture selection: a very simple genetic algorithm (GA) and was used to aid in the process of manually annotating iterative ablation. For the GA, we encoded features as each word of these sentences with one of the alpha cate- bits (acting as switches to specify which features be- gories i.e., atomic types. These were used to train a ran- long to the set). We used mutation only (bit-flip with a dom forest classifier. For this purpose we employed the probability of .05), a population of 100, and parent se- one included with scikit-learn (version 0.23.2), a widely lection through a tournament of 3. Search stopped at used ML package. We did not perform any hyperparam- 100 generations without improvement. The fitness func- eter tuning, and used the default parameters set by this tion was the mean of 5 evaluations of the accuracy of version of the package. There is possibly room for im- the feature set, each with a distinct and randomly se- provement here. For the aims of this work, we found it lected split of the training / testing data. This even- preferable to avoid introducing potentially confounding tually resulted in a set of 15 features: {WORD25, TAG, factors that could arise from hyperparameter optimiza- DEP, HWORD25, HWORD50, HWORD100, HPOS, HDEP, tion efforts. IS_ROOT , NER, WORD_BEFORE15, WORD_BEFORE100, WORD_AFTER15, PUNCT_BEFORE, POS_AFTER}. Feature definition. We consider an initial set that en- The iterative ablation procedure starts with the set of compasses all the potentially useful features that we all candidate features, and 100 runs of the learning algo- could derive from a standard NLP pipeline such as spaCy. rithm are performed, again each run randomly split into As we mentioned, it provides dependency parse labels two-thirds for training and one-third for testing. This (referred to, from now on, as DEP) and named entity provides us with a set of 100 accuracy measurements. recognition categories (NER). Parts-of-speech are pro- The process is then repeated, excluding one feature at at vided in two flavors: the more extensive OntoNotes tag time. The feature that most degrades mean accuracy is set (version 5) from the Penn Treebank, and the sim- excluded. If no feature has a negative impact on accu- pler Universal Dependencies (UD) part-of-speech tag set racy, then the one with the highest p-value (according to (version 2). Accuracy values for each of these elements the non-parametric Kolmogorov–Smirnov test) above a threshold is excluded. The procedure is repeated, ablat- 1 An open-source library for NLP in Python which includes convo- ing one feature at a time, until no remaining feature ful- lutional neural network models for tagging, parsing and named entity fills any of the previous two criteria. We performed this recognition in multiple languages. A relatively recent comparison of ten popular syntactic parsers found spaCy to be the fastest, with an ac- procedure with threshold p-values of .05 and .005. The curacy within 1% of the best one [18] first left us with a set of five features: F5 = {TAG, DEP, 9
HDEP, HPOS, POS_AFTER}; the second with three fea- Function ApplyPattern(seq, pos, pat ) tures: F3 = {TAG, DEP, HDEP}. Data: A sequence of edges seq, a position in the The results of these experiments are shown on the left sequence pos and a pattern pat side of figure 1. As can be seen, all of the three attempts Result: A sequence of edges with the initial edges outperform the set of all features. Interestingly, F5 is sig- replaced by a single one, if they match the nificantly better than F3, even at p < .005. The accu- pattern. racy of the GA set falls between that of F3 and F5. We if pat matches seq at pos then ed g e ←− reorder matching elements of seq to performed these experiments not only as an endeavor to align with pat achieve acceptable accuracy for the experiments that fol- seq 0 ←− matching part of seq replaced with low, but also to obtain empirical evidence regarding the ed g e relationship between SH types and traditional linguistic return seq 0 features. We can conclude that SH does not correspond else to some trivial mapping of any single linguistic feature. return ∅ For subsequent experiments we will use F5, given that end if it has the best accuracy and still uses a relatively small end number of features – something that can make a differ- Function BetaTransformation(seq) ence regarding the computational effort needed to parse Data: A sequence of edges seq large quantities of text. It is interesting to notice that F3 Result: An edge e still leads to a higher accuracy than the set of all features, if |seq| = 1 then and having only three features, such a classifier could return seq[0] be feasibly implemented in a purely programmatic way. end if A completely human-understandable classification tree heu best ←− −∞ could be produced, and also implemented in a very effi- seq best ←− ∅ cient way, sacrificing relatively little in terms of accuracy. for pos = 1 to |seq| do for pat ∈ P at t er ns do On the right side of figure 1 we present the accuracy of seq 0 ←− ApplyPattern(seq, pos, pat ) the classifier by text category, using F5. Here, it is inter- heu ←− h(seq, pos, pat ) esting to note that the best performing category (fiction) if seq 0 6= ∅ ∧ heu > heu best then and also one of the second-best (wikipedia, which is not heu best ←− heu significantly different from news) are out-of-corpus for seq best ←− seq 0 the training set of the ML model of the underlying lin- end if guistic features. It is remarkable that the accuracies that end for we achieve are comparable and may even surpass the end for values reported by spaCy (see above). In other words, if seq best 6= ∅ then this suggests that, far from accumulating errors down the return BetaTransformation(seq best ) stream of the various processing steps, our α stage ap- else pears to even correct upstream errors. return ((:/J) + seq[: 2] ) + seq[2 :] It is conceivable that more features become relevant, end if if a larger number of exotic cases becomes available end through larger training corpora. It is also conceivable Algorithm 1: The β transformation recursively ap- that larger windows (beyond just previous and next to- plies the patterns from type inference rules until ken) become relevant with larger datasets and more so- only the final hyperedge is left. phisticated ML approaches. Such considerations are be- yond the scope of this work. how β iteratively constructs a hyperedge, which need not 4.2 β-stage be a proper semantic hyperedge except at the final step. The β-stage transforms the sequence of atoms of the The process starts indeed with an initial hyperedge as the original sentence, each typed by the α-stage, into a se- simple sequence of typed atoms of the original sentence. mantic hyperedge that reflects the meaning of the sen- At each step, the elements of the currently-formed hy- tence and respects the SH syntactic rules. In practice, peredge are scanned from left to right to look for a sub- this operation amounts to a bottom-up process that ag- sequence of types that matches the list on the left side gregates the deeper structures of the sentence into in- of the type inference rules of table 2, taken as unordered creasingly complex hyperedges, by recursively combin- patterns i.e., up to any reordering. For instance, “capi- ing them until only a final, well-formed semantic hyper- tal of Germany” may have been parsed by α as a typed edge is left. sub-sequence “capital/C, of/B, germany/C”, which then The process for this transformation is formalized in matches the second pattern (B C C). It may then be rear- algorithm 1. Let us nonetheless explain in plain words ranged as such by putting the connector in first position 10
Figure 1: Left: accuracy of the α-classifier, comparing several feature sets; all includes all features, GA a features set obtained with a genetic algorithm, F3 is the outcome of iterative ablation with p < .005 and F5 with p < .05. Right: accuracy by source text category using F5. and preserving the order of the remainder of the hyper- the bottom-up process of the β-transformation. Finally, edge i.e., “(of/B capital/C germany/C)”, which conforms if there is still a tie, rules are applied by the order of pri- to the second inference rule of table 2. Note that, in prac- ority expressed in table 2, which is empirically organized tice, we also restrict the second and fifth patterns, i.e. by decreasing order of the depth at which each respec- the builder and conjunction patterns, to the minimum tive structure tends to appear in hyperedges. The special number of two arguments: respectively (B C C) and (J x rule for (+/B) is assigned the highest priority. x 0 ). We find that it fits NL more naturally and thus leads If no sub-sequence matches, the two first items in the to more correct parses. Further tasks of knowledge in- sequence are connected by prepending the special con- ference might later introduce builder- and conjunction- junction (:/J), which is meant to convey the most generic based structures with more arguments. We complement and abstract meaning of “these two things are related in the patterns with one rule that corresponds to the special the most generic sense”. This captures cases often found connector (+/B). This extra rule is admittedly needed to in natural language, such as: “A new era: quantum com- transform implicit builders (C C) into (+/B C C). putation is here.”, which translates to: If only one sub-sequence matches, it is transformed (:/J (a/M (new/M era/C)) (is/P (quantum/M into a sub-hyperedge by application of the rule. If two or computation/C) here/C)) more sub-sequences match, the β-stage needs to make a decision on which one to choose and proceed with as If the resulting hyperedge entirely conforms to one of if only one sub-sequence matched. For this case, we use the type inference rules, the process stops successfully as a heuristic function (this is function h in algorithm 1). it managed to form a recursively correct semantic hyper- This heuristic function relies on the grammatical struc- edge. Otherwise, the process is reiterated on the newly- ture of the sentence given by the dependency tree. Our formed hyperedge. The process is thus guaranteed to hypothesis is that grammatically connected edges are converge on a syntactically valid hyperedge, but is of more likely to belong to the same higher-order edge, so course not guaranteed to produce the most desirable or the first criterion of h is to always assign a higher score correct representation. However, we experimentally ver- to sub-sequences where all items are directly connected ify below that, given a correct classification from the α- in the dependency tree. By “directly connected in the stage and a correct dependency parse tree, this process dependency tree”, we mean that all hyperedges contain consistently leads to the construction of a SH that cor- one atom/token that is the head or the child of at least rectly conveys the meaning of the original sentence. one atom/token in another hyperedge, and that any hy- Let us first illustrate the β-stage in figure 2, which pro- peredge can be reached from any other, following such vides one example of an entire parsing process (using grammatical links. In case there is a tie, the heuristic the F3 feature set for simplicity). In figure 2(c), the re- function then prefers the sub-sequence that contains the cursive application of β-transformations to an initial se- deepest atom/token in the dependency tree – again as- quence of atoms can be followed. In the first step, we suming a correlation with SH depth, and thus respecting can see that the sequence (the/M, capital/C) matches the 11
Figure 2: (a) Dependency parse tree with dependency labels (green) and fine grained part-of-speech tags (red). (b) α-stage classification of atom types. (c) β-stage structuring of sentence by iterative application of the patterns from table 2. A non-selected pattern is greyed-out. pattern (M C), and the sequence (capital/C, of/B, ger- year/C) or (multi/M year/C) would be much preferable. many/C) matches the pattern (B C C+). We thus rely on However, this partially defective parse is still likely to be the above-mentioned heuristic function, which causes useful in the methods that we will discuss in the follow- (of/B capital/C germany/C) to be preferred to (the/M cap- ing sections. We also see how different type assignments ital/C). The reader can verify that selecting the latter at of the α-classifier can result, in practice, in correct hyper- this stage would lead to a dead-end. The rest of the SH edges at the end. We can also use this example to illus- construction is straightforward. trate another metric that we employ in this evaluation: the relative defect size. This is simple the ratio of the size Argument roles. Now that the core of the translation of of the defective part to the size of the entire hyperedge. NL into SH has been specified, assigning the argument Size is measured in total number of atoms (at any depth). roles introduced in Section 3.4 amounts to a trivial trans- A wrong hyperedge is one where the meaning of the lation from the dependency labels. Sometimes however, sentence is completely lost. For example, consider what the parser may fail to determine an argument’s role, and would happen if, in the above case, “stressed” was classi- thus classify it as unknown (that we code “?” for this pur- fied as a concept instead of a predicate. This also serves pose). to illustrate that there is a complex relationship between α-classifier accuracy and overall parser accuracy. Some 4.3 Validation of α and β mis-classifications at the α-stage can still allow for a completely correct parse, while others can lead to catas- To test the accuracy of the complete translation from trophic failures or just minor defects. Nonetheless, we NL to SH, we randomly selected 100 new sentences for observe on this sample of 500 sentences that a correct each text category, that were used neither for training α classification and dependency parse tree always lead nor testing of the α-classifier. We establish three cate- to the construction of an SH that preserves the meaning gories: completely correct hyperedges, hyperedges with of the sentence. By contrast, a badly-structured depen- some defect and completely wrong hyperedges. A hyper- dency tree appears to have a significant negative impact edge is considered to have a defect if overall meaning is on the functioning of β, through the heuristic function. preserved, but some subedge contains a defect. Let us If this result generalizes, this suggests that, for a given consider a real example from our dataset. The sentence: accuracy of the dependency parsing module, increasing “The scientists – who are part of a multi-year Interna- the quality of the NL to SH translation principally relies tional Shelf Study Expedition – stressed their findings are on improving α and the heuristic function. preliminary.” was parsed as: We show the results of this evaluation in table 4. It (stressed/P (:/J (the/M scientists/C) (are/P who/C (of/B is interesting to notice that “non-fiction” is one of the part/C (a/M (+/B (+/B (+/B multi/C -/C) year/C) worst performing categories in the α-classifier, but ends (+/B international/C (+/B shelf/C (+/B study/C up being the best one overall. Likewise, “fiction” is the expedition/C)))))))) (are/P (their/M findings/C) best category at α-stage but ends up being the second preliminary/C)) worst here. Unsurprisingly, “fiction” sentences tend to be richer in figures of speech and other complexities and The hyperedge preserves most of the meaning of the ambiguities that lead to a higher rate of catastrophic fail- sentence, but the concept (+/B (+/B multi/C -/C) ure. Conversely, “non-fiction” is the category with the year/C) is not correctly formed. Either (-/B multi/C most straight-forward sentences. In the “science” cate- 12
You can also read