Serving and Querying Open Knowledge Graphs on the Web

Page created by Curtis Simpson

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Serving and Querying Open Knowledge Graphs on the Web

KnowGraphs WinterSchool 2022

 Serving and Querying
 Open Knowledge Graphs
 on the Web - Part 2

  Axel Polleres

  referring to joint work with: Javier Fernández, Amr Azzam,
  Maribel Acosta, Martin Beno, Vadim Savenkov, Katja Hose,
  Christian Aebeloe, Gabriela Montoya

  Thanks to Javier and Amr for some of the slides!

JANUARY 2022

What I've planned for today:

         § Part 1:
           § Interlude
           § Practical Tutorial on querying Open KGs with SPARQL
           § Challenges/limitations of SPARQL over public endpoints
         § Part 2:
           § Serve and query KGs for local processing – HDT
           § Addressing the SPARQL endpoint bottleneck – where are we?
             §   Linked Data Fragments
             §   Smart-KG
             §   Wise-KG

PAGE 2

HDT - a Linked Data hacker toolkit

         • Highly compact serialization of RDF
         • W3C member submission 2011: https://www.w3.org/Submission/HDT/
         • Allows fast RDF retrieval in compressed space (without prior decompression)
              • Includes internal indexes to solve basic queries with small memory footprint.
                 •   Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, Jena, RDF3X.
                 •   Supports FULL SPARQL as the compressed backend store of Jena, with an efficiency on the
                     same scale as current more optimized solutions

                                           837 M.triples            NT + gzip   HDT
                                           122 GB                   9.6 GB      13 GB

                                                                                  ▷Slightly more but you can query!

         • Challenges:
             • Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but
               then it is ready to consume efficiently!)
             • Inefficient for live updates

PAGE 3

HDT (Header-Dictionary-Triples) Overview

                               H         eader

                 RDF                        metadata describing the RDF dataset

          ab..
                       bu ..
                               D
                               1 aa..
                                         ictionary

                               2 ab..
                               3 bu ..      Mapping between IDs     elements in the dataset
                       aa..

                               T         riples
                                              3    Structure of the data after the ID replacement
                               2
                                             1

PAGE 4

HDT – Header information:
$ hdtInfo wikidata20210305.hdt

   .
   .
  "14578569927" .
  "38867" .
  "1625057179" .
  "2538585808" .
  "_:format" .
_:format  "_:dictionary" .
_:format  "_:triples" .
  "_:statistics" .
  "_:publicationInformation" .
_:publicationInformation  "2021-04-24T12:42Z" .
_:dictionary   .
_:dictionary  "1451915667" .
_:triples   .
_:triples  "14578569927" .
_:triples  "SPO" .
_:statistics  "159085366343" .

                                                                                                           5
PAGE 5

Dictionary+Triples partition

                                                     Hint: imagine HDT for now
                                                       as "SPO-sorted Turtle"

           ex:Vienna   foaf:name     "Vienna"@en.
           ex:Javier   ex:workPlace  ex:Vienna;
                       foaf:mbox     "jfergar@example.org",
                                     "jfergar@wu.ac.at";
                       rdf:type      ex:Researcher .
           ex:Paul     ex:birthplace ex:Vienna.
           ex:Stefan   ex:birthplace ex:Vienna.
PAGE 6

Dictionary+Triples partition

             1    ex:Vienna
             2    ex:Javier>                                    4                   13
             3    ex:Paul>
                  ex:Researcher                    10                       9
             4
                                                            7
             5    ex:Stefan               2                                     1
             6    ex:birthPlace                         8
                                          8
             7    ex:workPlace
             8    foaf:mbox                             12              6                6
                  foaf:name                   11
             9                                                      5                    3
             10   rdf:type
             11   “jfergar@example.org”
             12   “jfergar@wu.ac.at”
             13   “Vienna”@en

                                                                                             7

PAGE 7

Dictionary (in practice)

         1      ex:Vienna
         2      ex:Javier
         3      ex:Paul
         4      ex:Researcher
         5      ex:Stefan
         6      ex:birthPlace
         7      ex:workPlace
         8      foaf:mbox
         9      foaf:name
         10     rdf:type
         11     “jfergar@example.org”
         12     “jfergar@wu.ac.at”
         13     “Vienna”@en
                                                      i.e.,dictionary is not
                                                      exactly "SPO-sorted"
         •   Split by role
                                                        but "SO-S-O-P"-
         •   Prefix-Based compression for each role           sorted
               • Efficient ID+String operations

PAGE 8

Dictionary compression: Plain Front Coding (PFC)

 relies on sophisticated prefix-based compression
    Each string is encoded with two values
             • An integer representing the number of characters shared with the previous string
             • A sequence of characters representing the suffix that is not shared with the previous string

         A
         An
         Ant                         (0,a) (1,n) (2,t) (3,ivirus) (9, Software) (0,Best)
         Antivirus
         Antivirus Software
         Best

                                                                                                              9

PAGE 9

Bitmap Triples Encoding

   •   Bitmap Triples, idea a bit similar to "sorted Turtle":
   ex:Vienna    foaf:name     "Vienna"@en.                         1 4 5.
   ex:Javier    ex:workPlace  ex:Vienna;                           2 2 1;
                foaf:mbox     "jfergar@example.org",                 3 3,
                              "jfergar@wu.ac.at";                      4;
                rdf:type      ex:Researcher .                        5 2.
   ex:Paul      ex:birthplace ex:Vienna.                           3 1 1.
   ex:Stefan    ex:birthplace ex:Vienna.                           4 1 1.

                                                                 subjects

                                                           Predicates:

                                                            Objects:

                    We index the bitsequences to provide a SPO index

PAGE 10

Bitmap Triples Encoding

     •    Bitmap Triples:
                                                                                  subjects

                                                                                             SPO
                                                                           Predicates:
                                                                                             SP?
                                                                                             S?O
                                                                             Objects:        S??
                                                                                             ???
               E.g. retrieve (2,5,?)
                   Find the position of the first and second ‘1’-bits in Bp (select)
                   Binary search on the list of predicates Sp in this range, looking for 5
                   Note that such predicate 5 is in position 4 of Sp
                   Find the position of the fourth ‘1’-bit in Bo (select) à 5
                   i.e. retrieve 5th value of So à 2

PAGE 11

On-the-fly indexes:
      HDT-FoQ (Focus-on-Querying indexes)

     •    From the exchanged HDT to the functional HDT-FoQ:
            • Publish and Exchange HDT (i.e., Bp,Sp,Bo,So from last slide) and
            • At the consumer:

                     Process                                    Type of Index   Patterns

                                                               Subject          SPO, SP?,
                 1   index the bitsequences                    SPO              S??, S?O, ???
                     index the position of each predicate      Predicate
                 2   (just a position list)                    PSO              ?P?, ?PO          separate index
                                                                                                  file , created by
                                                                                                  consumer client
                                                               Object
                 3   index the position of each object         OPS              ??O
                                                                                                  (or published as
                                                                                                  as well)

  Martínez-Prieto, M., M. Arias, and J. Fernández (2012). Exchange and Consumption of Huge RDF Data. In
  Proc. of the 9th Extended Semantic Web Conference (ESWC), pp. 437-452.
PAGE 12

rdfhdt.org community

PAGE 13

https://github.com/rdfhdt
    C++ and Java tools

                                                          HDT-cpp

          (+python!) https://github.com/Callidon/pyHDT/
PAGE 14

Results

     •    Data is ready to be consumed 10-15x faster than loading in an RDF triple store
           • HDT

Hands-on example: trying out the query that
   timed out on https://w.wiki/4mTj

      $ls

      wikidata20210305.hdt

      wikidata20210305.hdt.index.v1-1

      $ time hdtsparql.sh wikidata20210305.hdt "$(

Challenge 3: Scalability of SPARQL endpoints?
      - It's often too expensive to host Open KGs

           Challenge 3.1: serve        Challenge 3.2: serve
          complex/long running         many queries to many
          queries to single users       users concurrently

PAGE 17

Querying KGs with SPARQL "abstractly":
      What happens if you have many clients?

          KG = RDF Graph
          ○ A set of RDF triples ()
          ○ can be interpreted as edge-labeled directed graph.

          SPARQL
          ○ Standard query language for RDF (W3C recommendation).
          ○ Subgraph pattern matching

              → returns variable bindings (table)

PAGE 18

Server Solution: SPARQL Endpoint
      Client:                                       Network:       Server:

           SELECT ?v1 ?v2   ?v3                                SPARQL
           WHERE { ?v1      owl:starring "Brad Pitt" .
                    ?v1     rdfs:label ?v2;                    Endpoint
                    ?v1     dbo:director ?v3. }

          C1

                                    "Query Shipping"
Page 19

Server Solution: SPARQL Endpoint
       Client:             Network:                  Server:

                                                 SPARQL
                                                 Endpoint
  C1             C2
           C3

  C4             C5

           C6
  C7             C8

                      "Query Shipping"
Page 20                fails under concurrency

Alternative Client Solution: Data Dump

      Client:                     Network:                       Server:

                                                           SPARQL
                                                           Endpoint
   C1           C2
          C3

   C4           C5

          C6
   C7           C8

                           Data Shipping
Page 21              might add prohibitive load on the network

Alternative Client Solution: Data Dump

      Client:                     Network:                        Server:

                                                           SPARQL
                                                           Endpoint
   C1           C2
          C3
                                                            HDT
                                                            HDT
                                                            HDT

   C4           C5

          C6
   C7           C8

                Data Shipping – using HDT
Page 22              might add prohibitive load on the network

Variants in between:
     What are the current mitigations to the availability
     problem of the open RDF KGs?

          Linked Data Fragment Framework(LDF)
          Proposed to design new mixes of trade-offs.
      High Availability        High Availability            Low Availability
      High Client cost         Low Client Cost              Low Client Cost
      Low Server Cost          High Server Cost             High Server Cost

                           Triple Pattern Fragment
            Data                    (TPF)                       SPARQL
            Dump                  ISWC 2014                     Endpoint

                          Hybrid Shipping
Page 23

Triple Pattern Fragments :
Idea:
• Execute triple patterns on the server
• Let the clients do JOINs etc. by themselves.
à less footprint on the server, only triple patterns and intermediate results
   communicated.
à can still have significant overhead by large intermediate results

                                                                                                    HDT

                                                                               ?film   ?act
C1                                                                             …       …

                                                                               …       …

                                                                               …       …

                                                                               …       …

                R. Verborgh, M. V. Sande, O. Hartig, J. Van Herwegen, L. De Vocht, B. De Meester,
                G. Haesendonck, and P. Colpaert. 2016. Triple Pattern Fragments: A low-cost
                knowledge graph interface for the Web. J. Web Semant. 37-38 (2016), 184–206.

Binding-restricted
      Triple Pattern Fragments :
          Idea:
          • ship intermediate bindings with TP and let server only return results matching
             results
          • à smaller intermediate results, "join work" distributed between client and
             server

                                                                                                              HDT

                                                                                              ?film    ?act
          C1       ?film                                                                      …        …
                   …

                   …

                   …

                   …       O. Hartig and C. B. Aranda. 2016. Bindings-Restricted Triple Pattern Fragments.
                           In ODBASE 2016. 762–779
Page 25

Variants in between:
     What are the current mitigations to the availability
     problem of the open RDF KGs?

          Linked Data Fragment Framework(LDF)
          Proposed to design new mixes of trade-offs.
      High Availability        High Availability     Mod. Availability   Low Availability
      High Client cost         Low Client Cost       Low Client Cost     Low Client Cost
      Low Server Cost          High Server Cost      High Server Cost    High Server Cost

                           Triple Pattern Fragment
            Data                    (TPF)                SaGe                SPARQL
            Dump                                      WebConf 2019           Endpoint
                                  ISWC 2014

                          Hybrid Shipping
Page 26

Slightly different approach: SaGe

          Idea: keep working on the server, but improve fair allocation of resources,
           i.e. interrupt resource-intensive queries …

          •   BGP, UNION, FILTER are executed fully at the server as "interuptable iterators"
               •   … can be stopped and sent back to clients with results "so far"
               •   … while giving clients the possibility to resume them later on
               •   server suspends running query after a fixed quantum of time and resume the
                   next waiting query

          •   more complex operations are done on the client (e.g. OPTIONAL, SERVICE, ORDER
              BY, GROUP BY, DISTINCT, MINUS, FILTER EXIST and aggregations)

Page 27

Variants in between:
     What are the current mitigations to the availability
     problem of the open RDF KGs?

          Linked Data Fragment Framework(LDF)
          Proposed to design new mixes of trade-offs.
      High Availability                  High Availability     Mod. Availability   Low Availability
      High Client cost                   Low Client Cost       Low Client Cost     Low Client Cost
      Low Server Cost                    High Server Cost      High Server Cost    High Server Cost

                       smart-KG
                      WebConf 2020
                                     Triple Pattern Fragment
            Data                              (TPF)                SaGe                SPARQL
            Dump                                                WebConf 2019           Endpoint
                                            ISWC 2014

                          "Partition" Shipping
Page 28

Smart-KG
          Idea:
          • Partition graph by "predicate families", i.e. characteristic sets,
          • create 1 HDT per family
          • Combine TPF with partition shipping

    Client:                                      Network:                                       Server:

                                                                                               RDF Graph G

                                                                                        KG Partitions (separate HDTs)

             C1

                                                                      Smart-KG Server
Page 29                                                            (TPF + Partitions Server)

smart-KG server: Family Generator

 Partition Generator (PG): Upon loading a graph G, decompose it into
 partitions G1,...,Gm, one per "predicate family".

                                                          “Angelina Jolie
                                                                                                                      “Emma Watson”
           "BradPit
           t"                            "Oklahoma
                                         "                   name
                                   e
                                lac
                                                                                                           me
           name

                         th P                                               “Los Angeles”
                      bir
                                                                      worksIn             worksIn        na birthPl
                                       spouse                                                                          ace
    Brad              alm                                                                                  alm                           “Paris”
                          aM                    spouse                                                         aM
                             ate                         Angelina                                   Emma          ate
                                r                                                                                    r

                                                                        “Brad Pitt”                                “Oxford University”
                           “University of Missouri”
 F1: {name, birthPlace, almaMater, spouse}
                                                             F2: {name, spouse,worksIn}
Page 30                                                                                    F3: {name, birthPlace,almaMater, worksIn}

smart-KG server: Family Generator

 Partition Generator (PG): Upon loading a graph G, decompose it into
 partitions G1,...,Gm, one per "predicate family"… and convert these to HDTs.

    Family1.hdt                              Family2.hdt                   Family3.hdt

          "Brad Pitt"
                              Oklahoma
                                                "Angelina Jolie"                          "Emma Watson”
            name

                                e
                            lac

                                                  name
                         thP
                                                               lace                       me
                     bir          Angelina               birthP                        na              Paris
                      spouse                                               Emma         birthPlace
     Brad                                                          Los                  alm
                    alm                                                                     aM
                         aM                   Angelina    sp     Angeles                       ate

                                                                             Wo
                            ate                             ou                                    r
                                r                             se

                                                                               rks
                                                                                  In
                   University
                                                                              Los           Oxford
                   of Missouri                                     Brad                   University
                                                                            Angeles

Page 31

Smart-KG:

          Simplified client Query processing:
          1. Client decomposes BGPs into "stars"
          2. retrieves relevant information from server to make a query plan
          3. retrieves and joins matching partitions one by one
             (use TPF for 1-triple patterns)

                                               F?:{starring, name}

                                               F?:{wikiPageExtLink, birthPlace, gender}

                                                                                          KG Partitions (separate HDTs)

            C1
                         TPF

Page 32

Smart-KG:
Further details, cf. our paper:
• predicate-restricted families, i.e. pruning
         •    too rare or
         •    too common
         predicates for partitioning.
        Example: for DBpedia, a naive partitioning would create +600k partially very large families, which are
        unfeasible to serve.
•     partition caching

    Page 33

Experiments:

    server, 384 GB RAM
    up to 80 clients, 32 RAM
    1 GBit/s network (limited to 20Mbit/s per client
    • WatDiv up to 1B triples, up to 10joins
    • DBpedia, 12 random BPGs from LSQ

Page 34

Overall Query Performance

          Increasing Number of Clients

               100M Watdiv
Page 35

Overall Query Performance

             Increasing KG size

Resources Consumption

          Server Network & CPU & RAM

Page 37

•   execute star-patterns directly on the server (resources
                            allowed)
Next Step/Extension:           • … using an extension of TPF called SPF…
                            or on the client using SmartKG…
                              •   … based on comparing COST models for client and
WiseKG (WebConf 2021)             server execution, taking into account current server
                                  load:

Challenge 3: Scalability of SPARQL endpoints?
      - It's often too expensive to host Open KGs

           Challenge 3.1: serve                               Challenge 3.2: serve
          complex/long running                                many queries to many
          queries to single users                              users concurrently

                         Summary: HDT helps J
                         more ideas for improvements, e.g.:
                         - peer-to-peer?
                         - optimize partitioning (based onquery logs?)
PAGE 39

TODO:

    I owe you a full list of references, will be added shortly! J

Page 40

You can also read

Title - San Francisco Department of Public Health

THE WALKING DEAD TABLE GUIDE - BY SHORYUKENTOTHECHIN

Replacement Guidebook (Hardware) - Easy! Smooth! GP4000 Series W Model ST6000 Series (GP-Pro EX) - Hmisource.Com

BLOCK A - BROADMOOR - CBRE

TOP GLOVE CORPORATION BHD - Investor Presentation 26th October 2020 - Bursa Malaysia

OPENXR: UNIFYING REALITY - UNIFYING REALITY - BRENT E. INSKO, PHD - KHRONOS GROUP

Published in Chinese and English - Tobacco Asia I Media Planning Guide I Editorial Calendar

Glossary of Digital Marketing Terms - 42 Online Marketing Terms Innkeepers Need to Know - Odysys

SOUTH PARK: SUPER SWEET PINBALL TABLE GUIDE - BY SHORYUKENTOTHECHIN PAGE 1 OF 35

AT&T Where To Go For More Information Guide - CONTACT INFORMATION FOR BENEFITS AND RESOURCES - Webflow

Introduction to Moodle for Teachers - Workbook Edition 3 July 2014 Document Reference: 3756-2014

Prezi: Presenting With Style

OLD BRICK ORIGINALS KILN-FIRED THIN BRICK VENEER - Installation Guide GeneralShale.com

ZENworks 2020 Update 1 - What's New Reference June 2020

GLOBAL BUSINESS INSIGHT ON COFFEE AND TEA - STIR COFFEE AND TEA MAGAZINE I STIR-TEA-COFFEE.COM

Global Business Insight on COFFEE and TEA

2021 Price List CS Professional Suite - Thomson Reuters ...

ANTI-MONEY LAUNDERING POLICY - TeleTrade - DJ International Consulting Ltd June 2019 - DJ ...

FREE WIRELESS INTERNET THROUGH SMS - ADSA LIFE DEVELOPMENT SWEDEN

Continuing Care: Maintaining CPAP/BPAP During COVID-19 Pandemic Response

Package 'ryandexdirect' - June 9, 2020 - CRAN

Wedding Booking Form - CIA Photography

Cognitive Behavioral Therapy - SAMHSA

2018 INDUSTRY EXPO - Cape Town International Film ...

Eskom Energy Emergency Preparedness - Tjaart van der Walt National Operations

Read On Adult Literacy & Learning - Fall Classes 2021 - Lethbridge Public Library

AMMA STATEMENT GUIDE - Cromwell Property Group

What to Expect for Pharmacy Benefits and Drug Cost Trends for 2020

Norwegian Air Shuttle ASA - Q1 2019 Presentation 25 April 2019

2018 FALL ECONOMIC UPDATE SUMMARY - QUÉBEC - December 3, 2018 - Ordre des CPA du ...

Automating MIMO energy management with Machine Learning - Success story: Ericsson

Recharge Basics November 2011

Speaker Panel - Panel Moderator Alex Kim, Vicom Infinity Vincent Terrone, Vicom Infinity Patrick Kelly, Privakey Jay Desai, Xtreme Data Dr Carlos ...

Vaccine monopolies make cost of vaccinating the world against COVID at least 5 times more expensive than it could be

First Grade Math Units of Instruction 2019-2020

Remote learning policy - Approved by: Maidenhall Primary School

Strategic Improvement Plan 2021-2024 - Concord High School 8535 - Amazon AWS

SAP Security Forum June 18-19, 2020 - SAP Live Class - NOTE: Delete the yellow stickers when finished. See the SAP Image Library for other ...

Removal of Maxilon Red GRL from aqueous solutions by adsorption onto silica

Summer 2022: Guidance on contingency assessment arrangements for approved GCSEs, AS and A levels

Introducing a Family of Synthetic Datasets for Research on Bias in Machine Learning

Statistical Note: Ambulance Quality Indicators (AQI) - NHS England

ALICE UPGRADE FOR RUN3 - IN-KWON YOO (PUSAN NAT'L UNIV.) FOR THE ALICE COLLABORATION - CERN Indico

Teck Access Boundaries in the Elk Valley - Teck Resources Limited

"It's just puppy fat" Tackling obesity in children and adolescents

SMFM: Provider Considerations for Engaging in COVID-19 Vaccine Counseling With Pregnant and Lactating Patients

SAP Innovation Awards 2021 Entry Pitch Deck - Engaging the global equestrian community with SAP Analytics Cloud and SAP Cloud Platform EquiRatings ...

Password Management is Critical for Protecting Government Assets

FUNDING OF MOBILITY TO EU INSTITUTES - AMOUNT OF THE ERASMUS+ GRANT - Unibo

New Results from the Silicon Vertex Detector of the Belle II Experiment - EPS - HEP Online Conference 2021, July 26