Knowledge Graphs versus Property Graphs: Similarities, Differences and Some Guidance on Capabilities - TopQuadrant
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Knowledge Graphs versus Property Graphs: Similarities, Differences and Some Guidance on Capabilities
We are in the era of graphs. Graphs At the recent Data Governance Vision are hot. Why? Flexibility is one conference, we gave a talk on the topic strong driver: heterogeneous data, of supporting Data Governance using Graph Data Models: Knowledge Graphs. One of the questions Property Graphs and RDF Graphs integrating new data sources, and asked at the end of the talk was whether analytics all require flexibility. we were using Microsoft’s SQL Graph, When we say that over 90% of imple- mentations use either Property Graphs or Graphs deliver it in spades. and if not, then why not. After answering RDG Graphs, we mean implementations the question there on the fly, we decided that use some kind of an industry recog- Over the last few years, a number of new that it was time to write a short paper nized graph data model. Due to the graph databases came to market. As we explaining the differences between current expansive popularity of graphs, start the next decade, dare we say “the distinct implementations of graphs. many vendors are starting to represent semantic twenties,” we also see vendors their technology as graph based, when that never before mentioned graphs starting in reality they use a home-grown object to position their products and solutions as repository that can resemble certain graphs or graph-based. Today, there are two aspects of graphs. main graph data models: Graph databases are one thing, but This white paper is not intended to cover “Knowledge Graphs” are an even hotter • Property Graphs such implementations since they do topic. TopBraid EDG is a solution for creating (also known as Labeled Property Graphs) not use a recognized data model and, Knowledge Graphs and putting them to • RDF Graphs thus, there is no basis for comparison. work. (See page 10 for more information on (Resource Description Framework) If you are considering a technology that TopBraid EDG.) As a result, we are often asked claims to be graph based, our recommen- Other graph data models are possible to explain Knowledge Graphs. dation is to always ask what graph data as well, but over 90% of the implementa- • What are they? tions use one of these two models. We model it uses. • Why and where are they useful? will start by describing each of them. • How are they different from “just graphs?” 2
concept, but it currently offers more limited If you have worked with object databases, capabilities than either Neo4J or some of you will find it easy to understand the the other products that are using the prop- Property Graph data model. It is really Property Graphs erty graph data model. more of an object data model than a graph data model. Generally, the property graph data • Nodes are entities model consists of three elements: While there are core commonalities • Edges are relationships in property graph implementations, • Properties are attributes there is no true standard property • Nodes are the entities in the graph. Both, entities and relationships can Nodes can be tagged with zero to many graph data model. have attributes. text labels representing their type. Nodes Each implementation of a Property Graph are also called vertices. is, therefore, somewhat different. In the Property values can have data types. following, we will focus our discussion on • Edges are the directed links between Supported data types depend on the nodes. Edges are also called relationships. vendor. For example, Neo4j data types the characteristics that are common The “from node” of a relationship is called are similar, but not identical, to Java for any property graph database. the source node. The “to node” is called language data types. The most well-known implementation, which the target node. Each edge has a type. popularized property graphs as a concept, is Figure 1 shows a fragment of a property While edges are directed, they can be the Neo4J graph database. At minimum, graph with data about actors, directors and navigated and queried in either direction. everything stated here is true for Neo4J. films or TV programs they worked on. • Properties are the key-value pairs Nodes are represented as ovals. For exam- Other examples of property graph imple- associated with a node or with an edge. ple, the node with ID 123, as we can see mentations are TigerGraph and Titan. MS from its properties, represents Tom Hanks. SQL Graph is based on the same underlying Node labels are shown in dark blue. Node 123’s labels are Person, Actor and Director. 3
A PROPERTY GRAPH FRAGMENT WITH DATA ABOUT ACTORS, DIRECTORS, AND FILMS OR TV PROGRAMS label: Location “Name” White Plains “Name” New York City “ID” 126 label: City “Population” 58811 “ID” 127 “Name” A League of Their Own type: FILMED_IN type: FILMED_IN “Released” 1993 “ID” 12 “ID” 13 label: Location label: City label: TV Series NODE NODE LABEL “Name” The Post “ID” 124 “Released” 2017 “ID” 125 label: Movie PROPERTY KEY VALUE PAIRS “Role” Tony Bradlee EDGE type: ACTED_IN EDGE TYPE “ID” 11 type: ACTED_IN “ID” 14 label: Person type: DIRECTED “ID” 10 label: Actor “Role” Ben Bradlee “First Name” Tom label: Person “First Name” Sara “Last Name” Hanks “ID” 123 label: Director “Last Name” Paulson “ID” 128 “Year Born” 1956 label: Actor Figure 1: Simple Property graph excerpt with information about people and works of art 4
Relationships are depicted as grey arrows. • Some vendors, in addition to their own The fastest way to load bulk data is by Each relationship has a single type that query language, also implement some importing a text file. For property graph is shown in red. Properties are shown in subset of Cypher. For example, SAP Hana data, there is no standard serialization the rounded rectangles with the gold offers its own extensions to SQL and its (a way to represent graph data as a text background. Properties are connected to own GraphScript language plus they file). It is typical for a property graph nodes and relationships that they belong support a subset of Cypher vendor to define a CSV format that users to using red arrows. There is also Apache TinkerPop — an open should follow in order to prepare files for A key part of any data model is having a source graph computing framework that is bulk load. query language available for working with it. integrated with some property graph After all, users need to have a way to access and RDF graph databases. It offers the and manipulate the data in the graph. No Gremlin language which is more of an industry standard query language exists for API language than a query language. property graphs. Instead, each database A key requirement for working with any offers their own, unique query language that data model is the ability to reference is incompatible with others: nodes, properties and relationships • Neo4J offers Cypher also known as CQL (edges). In the case of property graphs, — its own query language that, to some internally, nodes and edges have IDs. extent, took SQL as an inspiration; IDs are assigned by a database and are internal to a database. Referencing is • TigerGraph offers GSQL — its own done by using text strings — node labels, query language that also took SQL as an inspiration; relationship types, and property names. • MS SQL Graph has their own extension to SQL to support graph query; 5
the object. Two nodes connected by an called Turtle. There is also a JSON serial- edge form a subject-predicate-object ization called JSON-LD as well as an XML statement, also known as a Triple or a serialization. All RDF databases are able RDF Graphs Triple Statement. While edges are direct- to export and import graph content in ed, they can be navigated and queried in standard serializations making it easy either direction. and seamless to interchange data. Everything in an RDF graph is called a RDF graphs use a standard graph Built-in Semantics resource. “Edge” and “Node” are just the data model. The standard for the RDF roles played by a resource in a given state- The RDF Data Model provides a richer, technology stack is managed by the World ment. Fundamentally in RDF, there is no semantically consistent foundation over Wide Web consortium (W3C), the same difference between resources playing an property graphs. Let’s see how a graph standards body that manages HTML, XML edge role and resources playing a node role. we showed earlier (Figure 1) is represented and many other web standards. Every data- An edge in one statement can be a node in as an RDF Graph (Figure 2). base that supports RDF is expected to another. We will give examples of this in support the model in the same way. Note that the diagrams depict relationships the diagrams that follow that will make using the recommended conventions of The RDF graph data model basically this core idea clearer. the property graph and RDF graph commu- consists of two elements: There is a standard query language for nities. Relationships in Property Graphs • Nodes, the vertices in a graph. Nodes RDF Graphs called SPARQL. It is both, a are typically capitalized with multiple words can be resources with unique identifiers full featured query language and an HTTP joined together by an underscore as in or they can be “literals” with values that protocol making it possible to send query ACTED_IN. Relationships (or any property) are strings, integers, etc. requests to endpoints over HTTP. in RDF graphs are typically identified using • Edges, the directed links between nodes. A key part of the RDF standard is the the lower camel case convention as in Edges are also called predicates and/or ex:actedIn. In both cases, these are simply definition of serializations. The most properties. The “from node” of an edge is recommended practices, not a “must have.” commonly used serialization format is called the subject. The “to node” is called 6
The graph in Figure 2 appears larger than Literal values in an RDF Graph can have The URIs identifying nodes are displayed the property graph in Figure 1 because all datatypes. The datatypes are taken from the in the diagram using qualified names, literal values are also depicted as nodes XML Schema (e.g., xsd:string, xsd:integer, commonly called Qname notation. To in the graph. All nodes are depicted as etc.) Text values can also have language form a Qname, the namespace part of rounded rectangles with the light yellow tags to support internationalization of data. the URI is abbreviated using a prefix. background. For example, instead of a single value for For example, “rdf:” and “rdfs:” represent rdfs:label for New York City we could have the built-in standard namespaces When visualizing RDF Graph data, it is multiple values such as: w3.org/1999/02/22-rdf-syntax-ns# common not to show literal values as nodes • “New York City” xsd:string @en and w3.org/2000/01/rdf-schema#, in order to make a cleaner and simpler respectively. looking diagram. That said, from the data • “Nueva York” xsd:string @sp structure perspective, they are part of the These namespaces define the semantics graph just like any other node. The only Identifier is a very important concept (the model behind) the RDF Data model. difference is that they can’t serve as a for RDF graphs. Every non-literal node is The built-in resources such as rdf:type source node i.e., a subject of a statement. assigned an identifier — typically, a URI/IRI. carry semantics that are defined in the They can only be targets or objects. Local, non-URI identifiers are possible, but standard. The built-in resources can be Throughout this paper, we will continue to rarely used because they are not interoper- used as either nodes or edges in a graph. show them in the diagrams as nodes. able. Globally unique identifiers bring many For an example of such semantics in edges, benefits to graph data models. An RDF- see the predicates (aka properties) rdf:type Although this makes the diagrams larger based solution can auto-generate URIs and rdfs:label in the RDF graph diagram in and busier, we believe it helps to illustrate based on selected URI construction rules. Figure 2. For an example of such semantics the differences between the two data Alternatively, when adding data (e.g., load- in nodes, see the node rdfs:Class that is models and the implications of these ing a serialized file), users can provide URIs the object of the rdf:type predicate in the differences on knowledge capture, that they want to use. diagram shown in Figure 3. graph design and graph evolution. 7
AN RDG GRAPH WITH THE SAME DATA ABOUT ACTORS, DIRECTORS, FILMS OR TV PROGRAMS rdfs:label rdf:type rdf:type rdfs:label wikidata: schema: wikidata: New York City White Plains 6 City 462177 e :filmedin e :filmedin e :population rdf:type rdfs:label schema: schema: The Post 59047 e :125 TVSeries Movie e :actedin Sara Paulson rdf:type e :released e :directed rdf:type e :actedin 1993 wikidata: e :124 e :Actor schema: schema: 2263 givenName familyName rdf:type schema: schema: schema: rdf:type rdfs:label birthDate givenName familyName wikidata: e :Director 257442 A League of Their Own 1956 Tom Hanks Figure 2: An RDF graph representing the information in the Property graph in Figure 1 8
A key differentiator that we will be introduc- support a common set of schemas for own URIs. These URIs have ‘ex:’ prefix — ing is how the underlying model (schema) is structured data markup. The prefix to illustrate that they are provided as represented in the same way as the data. ‘schema:’ stands for schema.org. Similarly, an example. Just to serve as a primer, “rdf:type” is a ‘wikidata:’ is a namespace used to provide For human users browsing data, a reference predicate used to connect a resource with a DBPedia data in a structured, knowledge to a resource URI will typically return infor- class it belongs to; “rdfs:label” is used to graph format. It provides a number of predi- mation about a resource presented as a web provide a display name for a resource. cates and classes with commonly agreed page. For APIs making a call, information and understood semantics. In the example, The uniformity of the data model makes can be returned in JSON, any standard we are using schema:givenName, schema:- RDF Graphs more easily evolvable and gives serialization of RDF or any other machine familyName and schema:City. In this way, them more flexibility compared toINProperty FILMED processable format. graphs developed by different organizations Graphs. We will see examples of this later can link and share common semantics. The part of the Qname after the prefix is in the white paper. called a local name. A local name could When organizations create their own be formed by using a display label if it can Enrichment through Composition knowledge graphs, they may use URIs of uniquely identify a resource within a name- With the inherent composability of RDF community defined resources as well as space and is considered immutable. It could Graphs, when two nodes have the same create resources for which they “mint” also be formed using a counter; much like in URI, they are automatically merged. This their own URIs. In the latter case, they relational databases a record gets the next means that you can load different files and would normally use a web domain they own sequential number as its ID. It could also be their content will be joined together forming as a namespace because a reference to a formed using a machine-generated random a larger and more interesting graph. resource in an RDF Graph is expected to ID or be based on the value of one or more resolve and return information about it. Examples of composability, can be found predicates that can establish a locally In our example, in addition to using URIs in the use of schema.org, and wikidata. unique identity. from RDF, RDFS, Wikidata and Schema.org, Schema.org is a namespace jointly setup we are also demonstrating the use of our by Google, Bing and Yahoo to create and 9
TopBraid EDG: An Enterprise Knowledge Graph Infrastructure for Data Governance RULES: If both of a person’s parents have blue eyes, they will also have blue eyes • TopBraid EDG, is a rich set of interconnected Knowl- edge Graphs expressing knowledge about how data is used and managed in the enterprise ecosystem. • These integrated Knowledge Graphs are ready to be enriched with your enterprise specific knowledge. • When this enrichment takes place, your enterprise is ready for implementing comprehensive Data Governance. MODELS : A person has eye color. A person has two parents. A person’s father is also a person and he is male. A knowledge graph contains facts about entities in the world together with the meaning of those facts expressed as models and rules. FACTS: James has blue eyes. James’ father is Andrew. James is a person. 10
Differences in Terminology connects a resource to its display name. In Note that some Property Graph databases Property Graphs it is typical to create a and Capability property called “name” and use it to hold a (e.g., SAP Hana) do not use the term “label” at all and, instead use the term Certain key terms used when describing display name for a node. You could also use “type” or “node type.” The underlying graphs actually mean very different things a differently named property. implementation, however, is the same — depending on the graph data model one type is a tag for a node or a tag for a In Property Graphs, the term “label” is used talks about. This is important to understand property. It is not a node itself. to identify the type of a node. It is called a to avoid confusion. It is also important to label rather than a type because it is simply a Let’s take a look in Figure 3, at a fragment understand in order to appreciate differen- string — a textual tag. It has no meaning of the same RDF graph we showed in ces in the capabilities that these two graph beyond the text. No information about it can Figure 2, now expanded with more informa- data models provide. be captured in a graph. Edges in a Property tion about types or classes and other We will now describe the differences in the Graph also have a tag that identifies the type schema elements. meaning and use of some key concepts — of an edge. It is called a “type” or, sometimes, • LABELS • TYPES • PROPERTIES. “relationship type”. It is used in queries The green border around nodes or edges when matching relationships, and it is also indicates graph elements that describe the What are Labels and Types used as a display name for edges when data model. In RDF, as in Property Graphs, graphs are shown visually. nodes can belong to more than one set In RDF Graphs, a label is a standard predi- (class). We see this with Actor and Director. cate defined in the RDFS namespace — Contrastingly, in RDF Graphs, the type of Tom Hanks is both. However, if one of the rdfs:label. It is used to point to the value of a a node or property’s type is a resource i.e., classes is a subclass of another, there is no display name for any resource. For example, another node in the graph — typically, with need in RDF to specify a “parent type.” the label for resource wikidata:Q6 in the additional information associated with it to Instead, this information is provided at the graph shown in Figure 2 is “New York City.” define its intended use and semantics. A class level for all resources that belong to a You could also use another predicate for this node is connected to its type using the class — because class information is also a purpose, but rdfs:label is widely accepted rdf:type predicate. part of the RDF graph. as a unique identifier of a property that 11
MODELING INFORMATION, REPRESENTED THE SAME WAY AS FACTS, CAN EXPAND AN RDF GRAPH e :population rdfs:label 59047 White Plains schema:Movie wikidata: schema:City 462177 rdf:type rdf:type rdfs:subClassOf rdfs:subClassOf e :filmedin schema:AdministrativeArea schema:CreativeWork e :125 “The Post” rdfs:Class rdfs:subClassOf rdfs:label rdf: type rdfs:subClassOf schema:Place schema:TVSeries e :actedin ACTED IN 1993 rdfs:label rdf:type e :released rdfs:subClassOf e :directed wikidata: ee ::124 124 2263 e :Actor schema:Person rdf:type schema: rdfs:label givenName rdfs:subClassOf rdf:type schema: schema: familyName birthDate A League of Their Own Tom Hanks 1956 e :Director Figure 3: Part of the RDF graph diagram of Figure 2 expanded with modeling information 12
For example, unlike the Property Graph in can add a label to the predicate ex:actedIn. namespace that is used for SHACL — a Figure 1, we do not say in Figure 2 that Tom Similarly, we could also say that when the language for defining rules and constraints Hanks is a person in addition to being an relationship ex:actedIn is used to navigate in for RDF Graphs, turning them into fully actor and a director or that Sara Paulson is a the opposite direction (from a movie to an fledged Knowledge Graphs. SHACL offers a person in addition to being an actor. We actor), the display name of the relationship very strong approach to ensuring the integri- simply say that there is a rdfs:subClassOf should be shown as ‘actors’. In an RDF Graph, ty of RDF data and more. relationship between the class of Actors and a resource that is used as a predicate in one the class of People. And the same for the statement can be used as a subject or object For instance, we can: class of Directors. The semantics of rdf:type in another statement. This is an example of • Consult a graph to find out what and rdfs:subClassOf are defined in the the additional flexibility that, among other properties are appropriate for, let’s say, standard — the graph depicted in Figure 3 things, lets us store information about predi- a movie and what are the valid values says that every resource of type Actor is cates and their usage. The edges in Property for these properties. also of type Person. Graphs offer nothing comparable. • Define constraints also known as rich data We can extend the RDF graph further to quality/validity rules. For example, as We also do not say that the type of New York explicitly define how a predicate should be shown in Figure 4, we have defined a min City or White Plains is a place (location) in used. For example, we could say that any range of allowed date values for the ‘re- addition to a city. We do not need to repeat resource of type schema:CreativeWork can leased’ property of a creative work (e.g., this fact for each city. We already said it in have a property ex:released and the value of a movie or a TV Series). Now, if a movie the model — each city is also a place and that property must be a date. This would released prior to 1900 is added to a graph, what is defined for a place will apply to a city. apply to a Movie or a TVSeries since they the graph can identify it as a problem. In an RDF Graph, we can capture any infor- both are subclasses of schema:Creative- While this example is simple, we can add mation about the model of the data that Work. The diagram in Figure 4 shows what to the graph much more sophisticated is stored in a graph. This information will be this looks like in a graph. rules. For instance, we could specify copy- stored, accessed and processed the same right regulations that must be in place for In Figure 4, the sh: prefix (e.g. in sh:property) way as any other data. For example, the resources released or published after a stands for w3.org/ns/shacl#, the standard graph diagram in Figure 3 shows that we certain date. 13
• Define rich inference rules. Inference All property values (literals and URIs alike) • In “data modeling speak,” in an RDF Graph rules generate new facts from the are stored as nodes. For example, as shown properties can be either attributes facts in the graph. in Figure 2: or relationships. These key capabilities turn RDF Graphs • The rdfs:label for the resource ex:125 is In Property Graphs, properties can only have into Knowledge Graphs. “The Post.” In this example, rdfs:label is a literal values. These are stored and treated property and “The Post” is a value. differently from the nodes in a graph. In data What are Properties modeling speak, properties in a Property In RDF Graphs, an edge is called a property • The edge ex:filmedIn is also a property. Its values for ex:125 are wikidata:Q6 and Graph are always attributes. This is why (predicate) and an object that a property property graphs are formally described as points to may be called a property value. wikidata:Q462177. directed, edge labeled, attributed graphs. sh:property schema:CreativeWork e :CreativeWork released rdfs:subClassOf rdfs:subClassOf sh:minValue sh:path sh:datatype schema:TVSeries schema:Movie 1900 e :released sd:date Figure 4: Extending an RDF graph with more modeling information about the ex:released property 14
A property structure is that of key-value “Name” White Plains pairs. This means that a property key can label: Location only have a single value. If it has more than “Population” “ID” 127 label: City 58811 56853 one value, then the single value is turned into an array of comma separated values. For an example, see Figure 5. Figure 5: In Property graphs, the property structure is that of key-value pairs — multiple values must be turned into an array of comma separated values Turning multi-valued properties into arrays makes it harder to efficiently answer queries such as “all cities with population over For example, Wikidata captures many representing Tom Hanks to the node 58,000.” The first value in the array is the representing the movie The Post and an details about the source of the information population of White Plains in 2018. The edge connecting Sarah Paulson to this about Tom Hanks’ birth date in order to second value is the population of White movie. The two edges have the same type, give users confidence in the reliability of Plains in 2010. There is no way in a Property but different identity. the data. As shown in Figure 6, it got the Graph to capture what each of these values information from 9 sources which all • In RDF, it is the same edge. This means represents beyond the fact that the key part agree on the date. The sources include the that if you need to say something about of the key-value pair is Population. This Encyclopedia Britannica, Internet Broadway a relationship between Tom Hanks and The brings us to the next important difference Database and others. Post (e.g., the role he played in the movie), — how to capture additional information you can’t simply add a statement to the about a property value. In saying this, we Differences in Attaching ex:actedIn property. If you do this, it will mean any property — whether it is an attri- Information about an Edge apply everywhere this property is used. bute or a relationship. As we see with the population example, it may be important to In RDF Graphs, unlike in Property Graphs, In other words, in the Property Graph data qualify a measurement by the date it was edges are typically re-used: model, edges uniquely identify the source- measured on. There are also other important • In the Property Graph shown in Figure 1, node — edge — target-node combination. In information qualifiers — including source there are two ACTED_IN edges with the RDF data model, they tend not to. Of and confidence. different IDs: an edge connecting the node course, one could create a unique edge and 15
simply give it the type ex:actedIn. However, this is normally not done because RDF databases are optimized for working with edges that represent types instead of occur- rences of types. To support the need to attach information on an edge between two specific nodes, RDF provides a way to create a new node that uniquely identifies the source-edge-target triple (or the subject-predicate-object in RDF speak) combination. With that in place, we can make statements about the new node using the regular approach — it can be a subject or an object of any statement. This is shown in Figure 7 where we created a new node ex:126 to represent the statement (triple) of Tom Hanks’ acting in The Post. The new node is connected to the statement about Tom’s acting in The Post using rdf:sub- ject, rdf:predicate, rdf:object and rdf:State- ment, built-in elements of the RDF data model that support this use case. Figure 6: A screenshot from WIkidata showing the sources of information about Tom Hanks’ birth date 16
Compared to Property Graphs, this RDF GRAPH WITH AN EXAMPLE OF MAKING A STATEMENT ABOUT ANOTHER STATEMENT approach is more powerful and flexible because it supports: “The Post” • Adding other edges (relationships) to edges. sd: string rdf:Statement For example, instead of having a role as a string, we may want to have a connection to a rdfs:label node representing Ben Bradlee, a person. This rdf:type rdf:ob ect is fundamentally not possible with Property schema:Movie ee: :125 125 e : 126 Graphs without changing (restructuring) the e :126 rdf:type original graph. rdf:predicate e :role • Adding more information to any property, not just a relationship. For example, we can use it e :actedin Ben Bradlee to specify the effective date of each population rdf:sub ect measurement for White Plains. This is also not possible with Property Graphs. wikidata: e :Actor 2263 rdf:type For Property Graphs, the solution to the need to add edges to other edges is to create intermedi- ate nodes — as shown in Figure 8. rdf: type rdf:type This requires restructuring of a graph and schema: schema: schema: givenName familyName birthDate changing all queries and logic because the path e :Director between actors and movies is now different Tom Hanks 1956 (compare with the original graph in Figure 1). With RDF, you do not need to make changes to Figure 7: RDF graph showing making a statement about another statement — the graph structure to make a link to the to attach information on an edge between two specific nodes. 17
resource representing Ben Bradlee. You IN PROPERTY GRAPHS ADDING EDGES TO OTHER EDGES REQUIRES REFACTORING THE GRAPH simply change the node at the end of the ex:role relationship from a string to a URI. “Name” The Post This is demonstrated in Figure 9. The ap- “Released” 2017 label: Movie Role proach is evolutionary and does not require any refactoring other than the change of the “ID” 16 “ID” 125 type: ROLE_IN value itself. type: ROLE_IN There may, however, be some other situa- tions where you would want to introduce label: Movie new intermediate nodes. If you do so, type: PORTRAYIN SHACL rules can be used to deliver the “ID” 129 “ID” 130 original relationship path inferring its value “ID” 15 from the new, more complex path. In this way, your existing queries and programs “First Name” Tom type: PLAYED_BY “Last Name” Hanks can remain the same. “Year Born” 1956 “ID” 17 The Property Graph solution to adding more label: Person “Name” Ben Bradlee information to a property (e.g., population) is to change the structure of the graph to label: Person turn a property into an edge and a value to a “ID” 123 label: Director node. This requires restructuring of a graph label: Actor NODE and change to all queries and logic for its NODE LABEL processing because the storage and access PROPERTY KEY VALUE PAIRS of properties is fundamentally different and EDGE separate from the graph traversal. This EDGE TYPE makes Property Graphs less evolvable or flexible than RDF Graphs. Figure 8: Refactored Property Graph with Ben Bradlee as a Person 18
Flexibility is acknowledged as the key A current downside of the RDF Statement Graph Analytics, Named differentiating advantage of graph approach to capturing information about databases. For example, the leading edges is what is sometimes called “graph Graphs and Other Topics vendor of property graph databases bloat.” To capture a role that Tom Hanks had This white paper is not intended to says, “With graph databases, IT and data in The Post, we need to add at least three completely cover all capabilities of Property architect teams move at the speed of extra statements (rdf:subject, rdf:predicate Graphs or Knowledge Graphs. We have business because the structure and and rdf:object) in addition to the role infor- focused only on critical differentiators. schema of a graph model flexes as applica- mation — four if you also add a type link to With this, we need to at least mention tions and industries change. Rather than rdf:Statement. Quite a lot of overhead for two important topics: exhaustively modeling a domain ahead of just one fact. If, however, you need to cap- time, data teams can add to the existing ture several facts about Tom’s acting in this • Algorithms for Graph Analytics graph structure without endangering movie, then this approach has less overhead. • Named Graphs current functionality.” We agree that this Graph analytics is a key application for A new extension to the RDF data model would be a very important and desired property graphs. By analytics, we mean called RDF* (RDF Star) and its variation advantage. However, as we describe in this node centrality, node similarity, shortest called RDF Plus address this issue. It is paper, changes in the model of the Property paths, clustering and other algorithms. currently in the process of being added to Graph data will require refactoring and Property Graphs are known for offering the standard. In the meantime, TopBraid changes to queries. In a Property Graph these algorithms and many applications EDG can create a new node with the URI edges and properties are different data of property graphs rely on such algorithms. composed from the subject-predicate-object structures and their handling in queries is Having said this, there isn’t anything special nodes of the statement you need to add fundamentally different. in a property graph data model that makes information to. The new node uniquely As you can see, compared to an RDF Graph, identifies the original statement and can be these algorithms possible. They can be it is harder to organically grow a Property used as a subject of other statements, avoid- applied equally well over RDF Graphs. In Graph in response to changes in your infor- ing graph bloat. For standard-compliant fact, many RDF-based solutions are also mation requirements. information exchange, EDG serializes such offering similar algorithms. nodes as RDF Statements. 19
IN AN RDF GRAPH, YOU DO NOT NEED TO CHANGE THE GRAPH STRUCTURE TO MAKE A LINK TO A RESOURCE “The Post” rdf:Statement rdfs:label rdf:type rdf:ob ect schema:Movie ee: :125 125 e : 126 e :126 rdf:type rdf:predicate e :role e :actedin wikidata: rdf:sub ect 2263 wikidata: e :Actor 2263 rdf:type rdf: type rdf:type schema: schema: schema: givenName familyName birthDate e :Director Tom Hanks 1956 Figure 9: RDF Graph with Ben Bradlee as a Person 20
The ability to partition data is important. ulations with it. This again follows the idea However, we increasingly hear of Relational databases partition data using of “separate, but connectable.” customers hitting the wall with Property tables and views. Both Property Graphs and Graphs because as they start to use them, For example, in TopBraid EDG, a given busi- RDF Graphs let users work with sets of nodes they recognize the need for one or more ness glossary or a taxonomy is a named of a specific type (in the case of Property of the following capabilities: graph. Resources in it can be connected to Graphs, nodes carrying a specific label), resources in other graphs, but it can also be • Capture of Schema in a Graph e.g., a query can be limited to only work with actors or to only work with directors. This manipulated as a distinct set of statements. • Support for Validation and Data Integrity For example, there could be a purpose asso- provides a very basic, limited partitioning. • Capture of Rich Rules ciated with a glossary as a whole e.g., its RDF data can also be partitioned in named users and uses can be identified and so on. • Support for Inheritance and Inference graphs. A named graph offers us a way to There is no similar concept in the Property • Globally Unique Identifiers say that some group of triple statements Graph world. • Resolvable Identifiers belong to a “sub-graph.” We can then give it • Connectivity Across Graphs a uniquely identifying name (hence, the term Limitations of • Better Solution to Graph Evolvability “named graph”) and associate any other Property Graphs Note that these are fundamental limitations information with it that we see as important. The idea is somewhat similar to views in In this white paper, we describe some that are not addressed in the design of relational databases. A single statement can limitations of Property Graphs and property graphs. In principle, it may be belong to many named graphs. Thus, it is a their differences with Knowledge Graphs possible to add at least some of these different concept from physically partitioning that are based on RDF. capabilities to a Property Graph — but not distinct graphs across different machines. that easily or elegantly. Some of you may have The main vendor for property graph already started on the road to doing this. We can query a named graph individually, or technology, Neo4J, offers a mature system we can query all available graphs, or a subset with some attractive, easy to get started However, it is a lot of effort, both conceptual of available graphs. We can load a named with capabilities. There are also a few (i.e., design and architecture) and imple- graph, clear it and perform any other manip- other Property Graph databases on the mentation work. Even if you succeed in market today. 21
accomplishing it, you will end up with a With Property Graphs, data modeling property graph database. We already proprietary home-grown version of capabili- happens on paper or on a white board, demonstrated how a decision to use inter- ties that already exist, are standardized and separate from the graph itself. Property mediate nodes in a property graph may be well proven. Graphs are not self-describing and the based on the need to add information to a meaning of the data they store is not a property, which is only possible if a property Inherent Semantics make part of a graph. is turned into an edge. it easy for RDF Graphs to Further, in property graphs some property become Knowledge Graphs Some Guidance for Moving values such as dates or names are often As illustrated in the previous sections, RDF- from a Property Graph to a turned into entities because there is no based graphs capture more than just data. Knowledge Graph efficient way of querying literal values, especially if they are multi valued. As a They capture the meaning or semantics of It is fairly easy to generate one of the result, you may have an entity for a number data, including rich constraints and highly RDF standard serializations from a property 58,811 or a year 1956. This, however, could expressive rules. All information is stored in graph. In fact, Neo4J offers a library for result in having so-called “dense nodes” or a graph and is available for query and any doing this. You can readily get the data out, nodes that participate in many relation- other algorithms that can help us reason and but you will not be able to get the semantics ships. Typically, nodes that are targets of discover new knowledge based on the avail- of the data; this is due to the fact that the thousands of relationships are considered to able knowledge. And the amount of the data model only exists in your initial design be dense in Neo4J with the potential of available knowledge with Knowledge sketches and, partially, within Cypher performance issues when such nodes are Graphs is practically unlimited — just as queries and programs. deleted. The design of the model may, it is on the world wide web. We can reach Further, as we have discussed, the structure therefore, be impacted by the density con- out and take advantage of the information of the graph data may be influenced by siderations. Similarly, you may have rela- available in other graphs. Separate, but the specific limitations of the property tionships that represent specific dates e.g., connectable is a key feature of the web — graph data model and optimizations that BORN_IN_1956, BORN_IN_1957, etc. This is and of Knowledge Graphs. were required due to the architecture of a a design pattern used in property graphs 22
because with a generic BORN_IN relation- access to data. If you have used GraphQL to ship, Cypher queries looking for people born build your solution on top of a Property in, let’s say 1956, do not perform well. Once you move to RDF, you may decide to revisit Graph, you will be able to keep much of your code as you move to an RDF platform like Summary some of these design decisions. TopBraid EDG that also supports GraphQL. The simplest way forward is to export prop- For property graphs, GraphQL Schemas erty graph data as-is and then create a data need to be manually created and then manu- Neo4J is a mature solution that popularized model in RDF that represents the structure ally maintained as the graph structures get Property Graphs and made them easy to get of the data. For example, if you created extended and changed. One of the advan- started with. People tend to think that RDF intermediate nodes in order to link roles to tages of a self-describing graph is that based Knowledge Graphs are hard to under- people portrayed by roles, you would mirror GraphQL Schemas can be automatically stand, complex and hard to get started with. this in your RDF model (often called an generated from the data model. This delivers In the past, there was some truth to that ontology) even if strictly speaking this is not on the promise of frictionless development characterization. Today, with products like necessary in the RDF-based implementation. and graceful systems maintenance by ren- TopBraid EDG, it is no longer the case. dering unnecessary any manual effort for Many users are discovering the limitations of TopBraid EDG can use data to reverse engi- defining and maintaining schemas. For more property graphs. Even if you started your neer an ontology. This will speed up your information on how TopBraid EDG works first graph project using a property graph, it migration efforts and will make the data with GraphQL, visit topquadrant.com/tech- is likely that sooner or later you will be model explicit. You can then decide if you nology/graphql/. hindered by limitations and will want to want to adjust the model and change the data or move forward with it as-is, evolving For the types of queries that can’t be easily adopt or at least explore the feasibility of an it later if necessary. supported by GraphQL, you will typically use RDF / Semantic Knowledge Graph based SPARQL. TopBraid EDG lets you use either of system. You will not be alone, as a number of Many applications today use GraphQL to organizations are graduating from property the query languages and it also lets you put read and write data. Neo4J and some other graphs to knowledge graphs. We hope that SPARQL expressions into GraphQL. Property Graph offerings support GraphQL this paper has provided some insight and value in your decision making. 23
GOVERNANCE PACKAGES AVAILABLE IN TOPBRAID EDG About TopQuadrant TopQuadrant helps organizations succeed Vocabulary Metadata Reference Data Business Management Management Management Glossaries in Data Governance. Its flagship product, TopBraid EDG, delivers easy and meaningful access for all data stakeholders to enterprise In addition to the above, TopBraid Tagger and In ramping up a Data Governance program, different metadata, business terms, reference data, AutoClassifier is a popular additional module organizations may have different starting points. With data and application catalogs, data lineage, that is part of a comprehensive information TopBraid EDG, you can start incrementally and add requirements, policies, and processes. management and governance environment capabilities as you go. For details on available EDG where packages for other types of assets can packages and additional modules visit topquadrant.com/ TopQuadrant’s customer list includes be easily added if needed. products/topbraid-enterprise-data-governance/ over 120 organizations in financial services, pharma, healthcare, digital media, govern- ment and other sectors. ©2020 TopQuadrant, Inc. All rights reserved. TopBraid Enterprise Data Governance–Vocabulary Management, and the TopQuadrant logo are trademarks of TopQuadrant Inc. in the U.S. All other trademarks are the property of their respective owners. Specifications subject to change without notice. For more details or to schedule a demo, contact us at: edg-info@topquadrant.com 24
You can also read