Enabling the Geospatial Semantic Web with Parliament and GeoSPARQL
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Semantic Web 0 (0) 1 1 IOS Press Enabling the Geospatial Semantic Web with Parliament and GeoSPARQL Editor(s): Krzysztof Janowicz, University of California, Santa Barbara, US Solicited review(s): Sven Schade, European Commission – Joint Research Centre; Jens Lehmann, University of Leipzig, Germany; Boyan Brodaric, Geological Survey of Canada, Canada Robert Battle, Dave Kolas Knowledge Engineering Group, Raytheon BBN Technologies 1300 N 17th Street, Suite 400, Arlington, VA 22209, USA E-mail: {rbattle,dkolas}@bbn.com Abstract. As the amount of Linked Open Data on the web increases, so does the amount of data with an inherent spatial context. Without spatial reasoning, however, the value of this spatial context is limited. Over the past decade there have been several vocabularies and query languages that attempt to exploit this knowledge and enable spatial reasoning. These attempts provide varying levels of support for fundamental geospatial concepts. GeoSPARQL, a forthcoming OGC standard, attempts to unify data access for the geospatial Semantic Web. As authors of the Parliament triple store and contributors to the GeoSPARQL spec- ification, we are particularly interested in the issues of geospatial data access and indexing. In this paper, we look at the overall state of geospatial data in the Semantic Web, with a focus on GeoSPARQL. We first describe the motivation for GeoSPARQL, then the current state of the art in industry and research, followed by an example use case, and finally our implementation of GeoSPARQL in the Parliament triple store. Keywords: GeoSPARQL, SPARQL, RDF, Parliament, geospatial, geospatial data, query language, geospatial query language, geospatial index 1. Introduction ingful query, like "What parks are within 3km of the Washington Monument?", depends on how the data is Geospatial data is increasingly being made avail- represented, whether all of the resources are related to able on the Web in the form of datasets described us- the Washington Monument, and if that relationship is ing the Resource Description Framework (RDF). The explicit. principles of Linked Open Data, detailed in [5], en- In this paper, we discuss an emerging standard, Geo- courage a set of best practices for publishing and con- SPARQL [24] from the Open Geospatial Consortium necting structured data on the Web. Linked Open Data (OGC)1 . This standard aims to address the issues of promotes the use of the SPARQL Protocol and RDF geospatial data representation and access. It provides Query Language (SPARQL) and RDF to query and a common representation of geospatial data described model data. While this is useful for querying for re- using RDF, and the ability to query and filter on the lationships that are explicitly represented in data, im- relationships between geospatial entities. First, we in- plicit relationships, such as geospatial relationships, cannot easily be queried. For instance, datasets may 1 Dave Kolas is a co-chair for the GeoSPARQL Standards Work- exist that describe monuments and parks, but being ing Group, and Robert Battle has worked on the Parliament imple- able to link these datasets based on their undeclared mentation and provided feedback to the development of the stan- relationships is difficult. The ability to answer a mean- dard. 1570-0844/0-1900/$27.50 c 0 – IOS Press and the authors. All rights reserved
2 troduce some geospatial concepts that are critical to 2.2. Coordinate Reference Systems understanding some of the design choices for Geo- SPARQL. We then describe topological relationships An important part of the metadata associated with a that are important to understand when designing a lan- geometry is its coordinate reference system (CRS) (al- guage for querying between spatial entities. Next, we ternatively known as its spatial reference system). The describe the motivation for GeoSPARQL, the current elements of a coordinate reference system provide con- state of the art in modeling and querying geospatial text for the coordinates that define a geometry in order data in the Semantic Web, and we introduce Geo- to accurately describe their position and establish rela- SPARQL. As authors of the Parliament2 triple store, tionships between sets of coordinates. There are four we are particularly interested in the capabilities that parts that make up a CRS: a coordinate system, an el- GeoSPARQL provides. We describe an implementa- lipsoid, a datum, and a projection. tion of a GeoSPARQL spatial index using Parliament A coordinate system describes a location relative to and a use-case that illustrates a simple example of data some center. A geocentric coordinate system places the integration with GeoSPARQL. center at the center of the Earth and uses standard X, Y, Z ordinates. A geographic (or geodetic) coordinate system uses a spherical surface to determine locations. 2. Geospatial Concepts In such a system, a point is defined by angles measured from the center of the Earth to a point on the surface. These are also known as latitudes (horizontal) and lon- Some basic understanding of geospatial concepts is gitudes (vertical). A Cartesian coordinate system is a required for discussion of GeoSPARQL. The follow- flat coordinate system on the surface. It enables quick ing sections define some of the terms used throughout and accurate measurements over small distances and is the rest of this paper. useful for applications such as surveying. An ellipsoid defines an approximation for the center 2.1. Features and Geometries and shape of the Earth. A datum defines the position of an ellipsoid relative to the center of the earth. This Features and geometries are two fundamental con- provides a frame of reference for measuring locations cepts of geospatial science. A feature is simply any en- and, for local datums, allows for accurate locations to tity in the real world with some spatial location. This be defined for the valid area of the datum. WGS84 is a could be a park, an airport, a monument, a restaurant, datum that is widely used by GPS devices that approxi- etc. A feature can have a spatial location that cannot mates the entire world. Geographic coordinate systems be precisely defined, such as a swamp or a mountain use an Earth based datum that transforms an ellipsoid range. A geometry is any geometric shape, such as a into a representation of the Earth. point, polygon, or line, and is used as a representation In order to create a map of the Earth, it must be pro- of a feature’s spatial location. For instance, Reagan jected from a curved surface onto the plane. This pro- National Airport is a geospatial feature because it is an jection will distort the surface in some fashion which entity that has a specific location in the world. It has a will mean that the coordinates for some locations are geometry that is a point with coordinates 38.852222, - more accurate than those in another. Some projections 77.037778 (in the WGS843 datum). Geometries can be will preserve area, so the size of all objects is relative, measured at varying resolutions, from a simple point while others preserve angles, and others try to do both. in the center of a feature to a complex, precise mea- A coordinate system projected onto a plane enables surement of a feature’s entire border. Spatial data typi- faster performance, as Cartesian calculations require cally separates features from geometries, although that fewer resources than Spherical calculations. Computa- is not always the case. tions across the plane, however, are inaccurate when they deal with large areas as the curvature of the Earth 2 http://parliament.semwebcentral.org is not taken into account. 3 http://en.wikipedia.org/wiki/World_ The combination of these elements defines a CRS. Geodetic_System One common source of well defined coordinate refer-
3 Table 1 ence systems is the European Petroleum Survey Group Simple Features, Egenhofer and RCC8 relations equivalence (EPSG)4 . Simple Features Egenhofer RCC8 2.3. Topological Relationships equals equal EQ disjoint disjoint DC All spatial entities are inherently related to some intersects ¬disjoint ¬DC other spatial entity. Whether two entities intersect touches meet EC somehow or are thousands of miles apart, the relation- within inside + coveredBy NTPP + TPP ship that they share can be described and evaluated. contains contains + covers NTPPi + TPPi In [28], Randell et al. describe an interval logic for overlaps overlap PO reasoning about space using a simple ontology that de- fines functions and relations for expressing and rea- geospatial Semantic Web data is represented in a con- soning over spatial regions. This logic is referred to as sistent logical way. With the input and acceptance of Region Connection Calculus (RCC). A subset of RCC, all of both knowledge base vendors and data produc- RCC8, defines eight mutually exhaustive pairwise dis- ers and consumers, GeoSPARQL has the potential to joint relations which can be used to imply the rest of unify geospatial RDF data access. the relations in RCC. These eight base relations are: Geospatial reasoning is critical in a large number 1. DC(x, y) (x is disconnected from y) of application domains (emergency response, trans- 2. x = y (x is identical with y) portation planning, hydrology, land use, etc.). Users in 3. PO(x,y) (x partially overlaps y) these domains have long utilized relational databases 4. EC(x,y) (x is externally connected with y) with spatial extensions [9]. These spatially extended 5. TPP(x,y) (x is a tangential proper part of y) databases have given the combination of efficient, sta- 6. NTPP(x,y) (x is a non-tangential proper part of ble storage and retrieval of data with geospatial calcu- y) lation and indexing. This allows questions like "Which 7. TPPi(x,y) (y is a tangential proper part of x) students live within 2km of the school they attend?" to 8. NTPPi(x,y) (y is a non-tangential proper part of be answered efficiently. x) Within the last decade, RDF storage solutions have become increasingly popular. These knowledge bases, The same set of eight geospatial topological rela- sometimes called triple stores, are capable of better tions is described with different names by Egenhofer in handling several types of problems at which relational [8], and includes the capacity to describe the relation- databases struggle or are not intended to perform: ship between different dimensioned geometries. This queries with many joins across entities [32], queries model was later generalized in the Nine Intersection with variable properties [32], and ontological inference Model [11]. The Open Geospatial Consortium Sim- on datasets. These features lend themselves towards ple Feature Access Common Architecture specifica- problems that involve data exploration, linkage across tion [25] uses the Nine Intersection Model introduced datasets, and abstraction from low level data. by Egenhofer to describe spatial relations for use in ge- Because of RDF stores’ ability to do inference and ographic access systems. Table 1 illustrates the equiv- easily link data sets, they have been of growing inter- alence between all of these spatial relations. est to the geospatial data community. Often geospa- tial domains have complicated type hierarchies which cannot be fully expressed in current geospatial infor- 3. Motivation for GeoSPARQL mation systems. For instance, a river is both a water- way and a transportation route. Also, geospatial do- The Open Geospatial Consortium is a non-profit main problems often require marrying multiple data standards organization focused on geospatial data. The sources together to solve a particular problem. In emer- OGC is composed of members of industry, academic gency response scenarios, population data, transporta- institutions, and government organizations. By stan- tion data, and even realtime police and fire data must dardizing GeoSPARQL within the OGC, we seek to be combined to deliver a timely result. Combining data leverage the experience of its members to ensure that sources on the web is useful to consumers as well; geospatial data about points of interest combined with 4 http://www.epsg-registry.org hotel information and travel route information could
4 lead to significantly more sophisticated travel planning for features where the geometries are either unknown than currently exists. or cannot be made concrete [13]. For example, if there As such, it was inevitable that those people inter- are assertions that a monument is inside a park, and ested in expressing geospatial data on the Semantic the park is inside a city, a qualitative reasoning sys- Web would want to combine the spatial indexing and tem should be able to infer that the monument is in calculation of spatial databases with the inferential the city through transitivity. Hybrid versions of these power and data linkage of RDF triple stores [16]. This systems are also possible, where some features have has been done by many groups in varying ways for concrete geometries and others have abstract geome- varying purposes, as will be discussed later. tries. By sharing a set of terms for topological rela- To provide geospatial reasoning and querying in tions, GeoSPARQL allows conclusions from quantita- a triple store, the implementors must define both an tive applications to be used by qualitative systems and ontology for representing spatial objects and query a single query language for both types of reasoning. predicates for retrieving these spatial objects. How- ever, each organization that has attempted this task has approached it in a slightly different way. As a re- 4. Geospatial RDF: State of the Art and Related sult, spatial RDF data that would be properly indexed Work and queryable in one implementation would simply be treated as plain RDF data in another implementation. Over the past decade, there have been many dif- This is the primary problem which the standard- ferent attempts to create a geospatial RDF standard. ization of GeoSPARQL attempts to solve. The Geo- Several different organizations, including the W3C, re- SPARQL language defines both a small ontology for search groups, and triple store vendors have created representing features and geometries and a number of their own ontologies and strategies for representing SPARQL query predicates and functions. All of these and querying geospatial data. are derived from other OGC standards so that they are In 2003, a W3C Semantic Web Interest Group cre- well grounded and understood. Using the new standard ated the Basic Geo Vocabulary [31] which provided should ensure two things: (1) if a data provider uses a way to represent WGS84 points in RDF. This work the spatial ontology in combination with an ontology was extended from 2005 through 2007 by the W3C of their domain, that data can be properly indexed and Geospatial Incubator Group [21] to follow the GeoRSS queried in spatial RDF stores; and (2) compliant RDF [23] feature model to allow for the description of triple stores should be able to properly process the ma- points, lines, rectangles, and polygon geometries and jority of spatial RDF data. their associated features. This group produced the Aside from providing the ability to perform spa- GeoOWL ontology5 which provides a detailed and tial queries, the small ontology portion of the Geo- flexible model for representing geospatial concepts. SPARQL specification is intended to provide an in- These ontologies were produced as products from their terchange format for geospatial data in a wide variety respective groups within the W3C as the result of col- of use cases. The ontology is meant to be attached to laborations across university and industry partners. other ontologies for various domains, providing only There are published datasets that use the W3C vo- the bare spatial aspects. It is intended to be simple cabularies [3] and a variety of triple stores that support enough to cover the most light-weight uses, and scale data represented by these ontologies [17,22]. However, in complexity for complex use cases. GeoSPARQL the associated working groups never moved beyond goes a long way to solving one of the research issues the incubator state and the respective ontologies never posed by Egenhofer in [10], namely that "we need a became official W3C recommendations. These ontolo- plausible canonical form in which to pose geospatial gies also only work with data in the WGS84 datum. data queries". In order to be valid, data in any other CRS must be GeoSPARQL is intended to inter-operate with both projected which can lead to inaccuracy in the data. quantitative and qualitative spatial reasoning systems. Support for spatial data in triple stores is mixed. A quantitative spatial reasoning system involves con- Several vendors support spatial data, but not all ven- crete geometries for features. With these concrete ge- ometries present, distances and topological relations 5 http://www.w3.org/2005/Incubator/geo/ can be explicitly calculated. Qualitative geospatial rea- XGR-geo-20071023/W3C_XGR_Geo_files/geo_2007. soning systems allow RCC type topological inferences owl
5 dors support the same representation of data or share fined in the OGC Simple Features Access [25]. Lit- the same support for relational queries. Some triple eral datatypes are introduced for Well-Known Text stores use the aforementioned W3C ontologies, while (WKT), Well Known-Binary (WKB), and their com- others have invented their own. pressed forms while the spatial relations that PostGIS Parliament, the high performance [18] triple store supports are implemented as SPARQL filter functions. from Raytheon BBN Technologies, provided the first In addition to vendor supported options, there has geospatial index for semantic web data in 2007 [16, been significant community and research interest in- 17]. This index supports data in the GeoOWL ontology representing and querying geospatial data in the last and introduces ontologies for querying spatial data via few years. Perry proposed an extension to SPARQL, RDF properties that correspond to the RCC spatial re- SPARQL-ST in [26]. This introduces a modified lations and OGC Simple Features relations. However, SPARQL syntax for posing spatial queries to data that data in the W3C Basic Geo vocabulary is unsupported. is modeled in an upper ontology based on GeoRSS. A Ontotext’s OWLIM-SE6 triple store can index point focus on describing data and metadata such as the CRS data represented in the W3C Basic Geo Vocabulary for geometries allows SPARQL-ST to operate with [22]. The only spatial relationship that can be queried data of different systems which is something that is is whether a point is contained within a circle or poly- lacking from many vocabularies such as GeoOWL and gon. The ability to query for relationships between the Basic Geo vocabulary. This increased flexibility higher order geometries such as lines and polygons is with data is hindered, however, by the proposed query not supported. syntax which deviates from the standard SPARQL lan- OpenLink Virtuoso7 also has support for the W3C guage. Any data that is in the SPARQL-ST format can Basic Geo Vocabulary. A SPARQL function is pro- only be accessed by a SPARQL-ST system. vided to convert a pair of latitude and longitude prop- Taking a simpler approach, Zhai et al., in [34] and erty values into a point geometry. A special literal [33], discuss the need for adding topological predicates datatype, virtrdf:Geometry, is also provided for to SPARQL. The OGC Simple Features relations and a indexing point literals. Support for testing intersection subset of geometries are used as the basis for their on- and containment relationships is provided via property tologies. Unfortunately, the relations have to be specif- functions. Through a combination of these relations ically encoded in RDF and the data cannot support and their negations, most of the Egenhofer relations multiple coordinate reference systems. Another approach described by Koubarakis and can be tested. However, some relations, such as testing Kyzirakos in [19] is based on research from the con- for overlap, cannot be supported in Virtuoso. straint database community. Constraint databases are Other triple stores take a completely different ap- a promising technology for integrating spatial data proach. As described in [12], Franz AllegroGraph8 de- [4]. The authors propose to enrich the Semantic Web fines a custom datatype to represent geospatial data with spatial and temporal data by extending RDF and and map it to a "strip" of space in the index that con- SPARQL with constraints. The proposed extension to tains the data. A modified SPARQL syntax provides a RDF, stRDF, uses typed literals to describe a semi- new GEO operator where the geospatial aspect of the linear point set. The query language, stSPARQL, ex- query is defined. tends SPARQL to includes additional operators for OpenSahara9 provides a service for adding exter- querying RCC relationships and introduces a new syn- nal indexing and geospatial querying capabilities to tax for specifying spatial variables. Support for mul- any triple store with a Sesame10 Sail layer. The im- tiple coordinate reference systems is not discussed. plentation is a wrapper for the PostgreSQL11 database Compatibility with existing data is also a problem as with PostGIS spatial extensions12 and as such, Open- not all data is represented as a semi-linear point set. Sahara supports all of the geometries and relations de- stSPARQL is intentionally not compatible with OGC standards as the authors believe that the semi-linear 6 http://www.ontotext.com/owlim/editions point set can be used to describe different geometries 7 http://www.openlinksw.com without forcing a hierarchy of datatypes on users. In 8 http://www.franz.com/agraph/allegrograph 9 http://www.opensahara.com [20], stSPARQL and stRDF are described with an ad- 10 http://www.openrdf.org ditional context of GIS applications. The authors con- 11 http://www.postgresql.org trast stRDF and stSPARQL with prior work (such as 12 http://postgis.refractions.net GeoOWL) and note that it is hard to find information
6 about geometries when you do not know what type of the triple store is able to efficiently store and query geometries will be in the answer set and what prop- data. Once again, the OGC Simple Features relations erties to ask about. This is a problem with represen- are used as the basis for posing queries for the spatial tations like GeoOWL where different geometries have relation. These relations are mapped to SPARQL filter different properties. The semi-linear point set avoids functions which allow for direct comparison between this by encapsulating the representation into a single spatial literals. This implementation is similar to the datatype. approach taken by Parliament and OpenSahara in the The NeoGeo Vocabulary [29] is a vocabulary that way that the SPARQL language itself does not need to arose from the NeoGeo community13 , the Linked Data be modified in order to query for the spatial relations community, and several universities. They recognized between entities. the need for a well formed standard representation for geospatial data and provided one that is based on 5. Introduction to GeoSPARQL the Geography Markup Language (GML) Simple Fea- tures Profile14 . To describe spatial relations, an ontol- As illustrated above, many groups have created on- ogy based on RCC8 is also provided. A limitation of tologies and query predicates to make indexing and the NeoGeo approach is the need to represent each co- query of geospatial data possible. However, since there ordinate as a resource. In particular, polygons and lines are many of these and they all differ slightly, data that are represented with an RDF collection of Basic Geo can be spatially queried in one knowledge base may points. While this does allow points to be shared across not be able to be spatially queried in another. Geo- geometries, it significantly increases the verbosity of SPARQL provides a standard for geospatial RDF data the data. Unfortunately, this extra verbosity does not insertion and query, which covers the use cases of the result in particular gains in expressive power, since other previous approaches. each point in a polygon provides little value in isola- The GeoSPARQL specification attempts to enable a tion and RDF list contents are difficult to query for in wide range of geospatial query use cases, from sim- SPARQL 1.0. Moreover this prevents geometry liter- ple points of interest knowledge bases to detailed au- als from being compared easily in SPARQL filter func- thoritative geospatial data sources for transportation. tions. The ability to use content negotiation to retrieve Moreover, use of GeoSPARQL for both of these types other formats for geometries mitigates these issues for of data should enable the data sets to be easily used some applications. A nice thing about this approach is together. In order to achieve this goal, different con- that it does not require any complex literals to parse, formance classes are provided. This means that a sim- and thus may be attractive to Semantic Web practition- ple knowledge base implementation intended for sim- ers not familiar with geospatial data formats. ple use cases need not implement all of the more ad- In [15], Hu and Du compose a three level hierar- vanced reasoning capabilities of GeoSPARQL, such as chical spatiotemporal model: a meta level for abstract quantitative reasoning or query rewriting. space-time knowledge, a schema level for well-known There are also several sets of terminology for the models in spatial and temporal reasoning (e.g., RCC, topological relationships between geometries. Rather Allen time[1]), and an instantiations level that pro- than mandate that all implementations use the same vides mappings and formal descriptions of the various set of terminology, each implementation can choose ground spatiotemporal statements in the Linked Data which sets of terms to support. This is discussed fur- clouds. This meta approach provides a convenient way ther in the section on GeoSPARQL relationships. to abstract out spatial knowledge from its underlying GeoSPARQL attempts to address the problems with representation. Mappings, however, have to be defined the disparate and incompatible implementations for for each dataset at the instantiations level. representing and querying spatial data. It achieves this A complete implementation of integrating spatial by defining an ontology that closely follows the exist- data and queries into an RDF triple store is described ing standards work from the OGC with regard to spa- by Brodt et al. in [7]. By typing WKT string repre- tial indexing in relational databases. sentations of geometry literals with a spatial datatype, The GeoSPARQL specification contains three main components: 13 http://sites.google.com/site/neogswvocs/ 1. The definition of a vocabulary to represent fea- 14 http://www.ogcnetwork.net/gml-sf tures, geometries, and their relationships.
7 2. A set of domain-specific, spatial functions for such as point, polygon, curve, arc, and multicurve. The use in SPARQL queries. geo:asWKT and geo:asGML properties link the ge- 3. A set of query transformation rules. ometry entities to the geometry literal representations. Values for these properties use the sf:wktLiteral 5.1. GeoSPARQL Ontology and gml:gmlLiteral data types respectively. The ontology for representing features and geome- 5.2. GeoSPARQL Relationships tries is fundamental to being able to build and query spatial data. The ontology is based on the OGC’s Sim- GeoSPARQL also includes a standard way to ask for ple Features model, with some adaptations for RDF. topological relationships, such as overlaps, between Note that prefix definitions are omitted from exam- spatial entities. These come in the form of binary prop- ples for clarity; a definition of all of the prefixes used erties between the entities and geospatial filter func- is at the end of the paper in listing 1515 . The ontol- tions. ogy includes a class geo:SpatialObject, with The topological binary properties can be used in two primary subclasses, geo:Feature and geo: SPARQL query triple patterns like a normal prop- Geometry. These classes are meant to be connected erty. Primarily they are used between objects of the to an ontology representing a domain of interest. Fea- geo:Geometry type. However, they can also be tures can connect to their geometries via the geo: used between geo:Features, or between geo: hasGeometry property. Features and geo:Geometrys, if GeoSPARQL’s For example, an airport is a geo:Feature. It is query rewrite rules are supported (discussed in the a conceptual thing that exists in the real world in a next section). The properties can be expressed using particular place. In the real world, it has a location three distinct vocabularies: the OGC’s Simple Fea- that corresponds to the coordinates of all of the infi- tures, Egenhofer’s 9-intersection model, and RCC8. nite points along the border of the airport area. This Which of these vocabularies is supported can be de- real-world location has to be measured and estimated pendent on the triple store implementation, though it is in some way. It is possible to do this measurement at likely that implementations will support all three. The various resolutions, each of which may serve well for Simple Features topological relations include equals, different purposes. A representation of the real world disjoint, intersects, touches, within, contains, overlaps, and crosses. location which has been measured becomes a geo: The filter functions provide two different types Geometry. Thus the airport may have several geo: of functionality. First, there are operator functions Geometrys, ranging from a single point in the cen- which take multiple geometries as predicates and pro- ter of the airport to an extremely detailed polygon that duce either a new geometry or another datatype as closely follows the airport’s outside border. A geome- a result. An example of this is the function ogcf: try that will function for most purposes within a dataset intersection. This function takes two geometries can be specified as the geo:defaultGeometry. and returns a geometry that is their spatial intersec- GeoSPARQL includes two different ways to repre- tion. Other functions like ogcf:distance produce sent geometry literals and their associated type hier- an xsd:double as a result. The second type of func- archies: WKT and GML. An implementor of a spa- tionality is boolean topological tests of geometries. tial triple store may choose to support either or both These come in the same three vocabulary sets as the of these representations. GeoSPARQL provides differ- topological binary properties: simple features topo- ent OWL classes for the geometry hierarchies associ- logical relations, Egenhofer relations, and RCC8 re- ated with both of these geometry representations. This lations. These functions are partially redundant with provides classes for many different geometry types the topological binary properties; however, the topol- ogy functions take the geometry literals as parameters, 15 In order to be more relevent to the forthcoming finalized Geo- while the binary properties relate geo:Geometry SPARQL standard, this paper makes use of an updated OGC internal and geo:Feature entities. This means that quanti- draft of GeoSPARQL. While the functionality is the same as the pub- tative and qualitative applications can both make use lic draft, the uniform resource identifiers (URIs) for GeoSPARQL predicates and classes have been updated. We have chosen to use the of the binary properties, but only quantitative applica- newer URIs in this document in order to be more consistent with the tions can make use of the topology functions. Also, standard once it is released. comparisons to concrete geometries provided in the
8 query can only be made via the functions. An example of the topological functions is ogcf:intersects, Listing 2: Example Ontology which returns true if two geometries intersect. ex:Restaurant a owl:Class; 5.3. Query Transformation Rules rdfs:subClassOf ex:Service . ex:Park a owl:Class; rdfs:subClassOf ex:Attraction . The query rewrite rules allow for an additional layer ex:Museum a owl:Class; of abstraction in SPARQL queries. While only con- rdfs:subClassOf ex:Attraction . crete geometry entities can be quantitatively com- ex:Monument a owl:Class; pared, it nonetheless sometimes makes sense to discuss rdfs:subClassOf ex:Attraction . whether two features have a particular topological re- ex:Service a owl:Class; lationship. This is represented in the natural language rdfs:subClassOf ex: question, "Is Reagan National Airport within Washing- PointOfInterest . ton, DC?". Although Reagan National is referred to as ex:Attraction a owl:Class; a Washington DC airport, it is actually across the Po- rdfs:subClassOf ex: tomac river in Virginia. In GeoSPARQL, feature to fea- PointOfInterest . ture and feature to geometry topological relations are ex:PointOfInterest a owl:Class; achieved by the combination of the use of the geo: rdfs:subClassOf geo:Feature . defaultGeometry property and the query rewrite rules. If a geo:Feature is used as the subject or ob- ject of a topological relation, the query is automatically rewritten to compare the geo:Geometry linked as a default, thus removing the abstraction for processing. Listing 3: Washington Monument Listing 1 shows a query with a relationship between ex:WashingtonMonument a ex:Monument; geo:Feature objects before and after rewrite. rdfs:label "Washington Monument"; geo:hasGeometry ex:WMPoint . Listing 1: Query Rewrite Example ex:WMPoint a geo:Point; geo:asWKT "POINT(-77.03524 #Before 38.889468)"^^sf:wktLiteral. ASK { ex:DCA a geo:Feature; geo:sfWithin ex:WashingtonDC . 6. Using GeoSPARQL ex:WashingtonDC a geo:Feature . } Consider an example using a points of interest ontol- ogy in listing 2. We seek to represent points of interest #After of various types (Monuments, Parks, Restaurants, Mu- ASK { seums, etc.). These types of landmarks are represented ex:DCA a geo:Feature; in a class hierarchy with a ex:PointOfInterest geo:defaultGeometry ?g1 . class at the top. These classes of course may include ex:WashingtonDC a geo:Feature ; many non-spatial attributes, but only a label is included geo:defaultGeometry ?g2 . here. All that is required to link this ontology with ?g1 geo:sfWithin ?g2 . GeoSPARQL, and thus give its classes a geospatial ref- } erence, is to make ex:PointOfInterest a sub- class of geo:Feature. This may of course result in The goal of this feature is to provide a more intu- the class having two parent classes, but this is compat- itive approach to geospatial querying for use cases that ible with RDFS and OWL reasoning. do not require many different geometries, while still If compliance with WKT is chosen, and latitudes maintaining a concrete definition of this intuitive un- and longitudes are expressed in WGS84 datum in a derstanding. Compliant GeoSPARQL triple stores are longitude latitude order (CRS:84), a record for the not required to implement this feature. Washington Monument would look like listing 3. If a
9 coordinate reference system other than CRS:84 is de- sired, that can be included within the literal. Listing 4 Listing 5: Example Query 1 expresses the same point in WGS84 with latitude lon- gitude order. SELECT ?m ?p WHERE { ?m a ex:Monument ; Listing 4: Point in WGS84 datum geo:hasGeometry ?mgeo . " POINT(38.889468 geo:hasGeometry ?pgeo . -77.03524)^^"sf:wktLiteral ?mgeo geo:within ?pgeo . } While this representation may seem verbose, and the literal string is no longer standard WKT, it has the ad- vantage of encoding the CRS information directly into the literal. This is all of the data needed to define a geo:Geometry; without the CRS, another property Listing 6: Example Query 2 would need to be added onto the geo:Geometry in- SELECT ?m ?p stance, and that property would need to be read and WHERE { passed into filter functions as well. While an argument ?p a ex:Park . can be made that it produces a "cleaner" model if the ?m a ex:Monument ; CRS was associated with a geometry via a separate geo:within ?p . predicate, GeoSPARQL focused on producing a com- } pact representations that could contain the entire de- scription of a geometry within a single literal. For the case of WKT, this necessitates adding the CRS specifi- cation to the WKT literal string. For the GML confor- mance class, the CRS information is already encoded Listing 7: Example Query 3 in the GML string so no changes are required. SELECT ?a Now we will look at a few example GeoSPARQL WHERE { queries using this data. One potential query over this ?a a ex:Attraction; dataset would be, "Which monuments are contained geo:hasGeometry ?ageo . within a park?" This query requires a topological com- FILTER(geof:within(?ageo, parison between the geometries of the monuments and "POLYGON(( the geometries of parks. We show it in listing 5 using -77.089005 38.913574, the binary topology property geo:within. The two -77.029953 38.913574, entities have type statements, geo:hasGeometry -77.029953 38.886321, properties to tie them to their geometries, and then the -77.089005 38.886321, geo:within function to tie them together. -77.089005 38.913574 If the knowledge base being used supported the ))"^^sf:wktLiteral)) query rewriting rules, and the data set included de- } fault geometries, the first query could be rewritten even more simply using a feature-to-feature topological re- lationship. This method is demonstrated in listing 6. Spatial user interfaces often need to look for entities tion entity and its attached geometry, and the geometry of a particular type that fall within an explicit bounding is compared with the filter function geof:within. box. Consider the query, "What attractions are within This query is shown in listing 7. Note that the bound- the bounding box defined by (-77.089005, 38.913574) ing box is expressed as a polygon. and (-77.029953, 38.886321)?" Because we need to Queries looking for entities within a particular dis- specify an explicit geometry in the query, we need to tance of either other entities or a current location are compare to it using the topological filter functions as extremely useful as well. "Which parks are within opposed to the binary properties. We have the attrac- 3km of the Washington Monument?" can be easily ex-
10 lows GeoSPARQL data to be indexed and provides a Listing 8: Example Query 4 query engine that supports GeoSPARQL queries. Parliament has a modular architecture that enables SELECT ?p indexes to be built on top of the storage engine. One of WHERE { these indexes is a spatial index based on a standard R- ?p a ex:Park ; tree implementation [14]. In a similar approach to [7], geo:hasGeometry ?pgeo . the spatial index is integrated in a way that provides na- ?pgeo geo:asWKT ?pwkt . tive support for spatial relational queries and efficient storage of the data. The general goal for this index is to ex:WashingtonMonument split SPARQL queries with geospatial information into geo:hasGeometry ?wgeo . multiple parts, allowing for an optimized query plan ?wgeo geo:asWKT ?wwkt . between the spatial components of the query and the components with non-spatial triples to be executed. FILTER(geof:distance(?pwkt, ?wwkt, Before the emergence of GeoSPARQL, Parliament units:m) < 3000) indexed data represented in GeoOWL and allowed } RCC8 and OGC Simple Features relations to be queried [16,17]. We are currently implementing Geo- SPARQL based on the public candidate draft standard, pressed in GeoSPARQL. We assume the same URI for and we describe this implementation in the following the Washington Monument in the data example above. section. We need to retrieve the two geometries and use the function geof:distance to calculate the distance 6.2.1. Index Specification between them. A standard SPARQL less than func- The index interfaces in the Parliament API include tion is then applied. This query is shown in listing 8. methods for building a record from data, adding and These example queries require relatively little in terms removing records, finding a record by URI, and finding of non-spatial constraints, but serve to illustrate some records by value. Existing indexes include the afore- basic functionality with GeoSPARQL. With a more mentioned GeoOWL spatial index, a temporal index complex ontology, queries could include more compli- (for indexing OWL Time16 ), and a basic numeric in- cated thematic elements as well. dex (for optimizing range queries on numeric property values). 6.1. Exploiting Geospatial Data The Parliament triple store is built with support for Jena’s RDF Application Programming Interface (API) Storing geospatial data just as RDF triples does not and ARQ SPARQL query engine17 . An implementa- allow for the spatial exploitation of that data. In order tion of Jena’s graph interface18 provides access to the to be able to efficiently query for the relationships be- base graph store with support for adding, removing, tween spatial entities, the data must be indexed. This and finding triples. By attaching a listener19 to the allows only those resources that match the spatial com- graph, the addition and removal of triples can be de- ponent of a query to be retrieved, rather than spatially tected and forwarded to any associated indexes. Parlia- filtering all bindings that match a given query. The rel- ment’s GeoSPARQL index listens for triples that con- ative performance advantages of using a spatial index tain the geo:asWKT or geo:asGML predicates. Any versus spatially filtering a result set is discussed by triple that is added to the graph is checked to see if it Brodt et al. in [7]. matches. At this point, the index can create a record for the geometry that is represented in the object of the 6.2. Parliament and GeoSPARQL triple and insert it into the index. For instance, when adding the triples in listing 3 to Parliament, the index In order to take advantage of data represented in 16 http://www.w3.org/TR/owl-time/ GeoSPARQL, a SPARQL endpoint needs to under- 17 http://www.openjena.org stand the GeoSPARQL ontology and provide support 18 http://openjena.org/javadoc/com/hp/hpl/ for one (or more) of the conformance classes. The Par- jena/graph/Graph.html liament triple store provides one such implementation 19 http://openjena.org/javadoc/com/hp/hpl/ (based on a draft version of the specification) that al- jena/graph/GraphListener.html
11 Listing 9: Property Function Query Listing 10: Optimization Example Query SELECT ?x SELECT ?m WHERE { WHERE { ?x apf:concat( "Hello", " ", " ?m a ex:Monument ; World") . geo:hasGeometry ?mgeo . } ?mgeo geo:within ex: NationalMallGeometry . } will generate a single record that contains a reference to the resource ex:WMPoint and its WKT value. In order to query for data in an index, we have In this example, there are two sub patterns: the in- extended the ARQ query engine to support property dex property function, and the basic graph pattern for functions that can access indexes. When ARQ parses ?m. Each sub pattern estimates how many results that a SPARQL query, it detects the different operators in it will be able to provide. For operations within the the query. A particularly useful feature of ARQ is the spatial index, a bounding box query can be performed support for property functions. Instead of matching a to estimate how many results will be returned. In this triple in a graph, property functions can execute cus- example, the index will look up the bounding box for tom code in the context of the SPARQL query. Con- ex:NationalMallGeometry and calculate how sider the query in listing 9. It will yield a result set many items it contains. For basic graph patterns, Par- with a single binding for ?x by concatenating all of the liament keeps statistics on the triples it contains and can quickly estimate how many matches are in the arguments together to form the literal "Hello World" store for a given triple pattern. Sub patterns containing instead of attempting to match the triple pattern. Par- basic graph patterns use these statistics to estimate how liament implements the GeoSPARQL spatial relations, many triples will be bound by the pattern. After each such as geo:intersects, as property functions. sub pattern has an estimate, they are ordered in ascend- ing order executed accordingly. In this case, if there 6.3. Optimizing Query Execution were 500 monuments with geometries, but only 100 geometries within the bounding box, the index prop- By using an index, a query can be optimized such erty function would be executed first. If, however, there that the most selective part is executed first. Con- were 500 geometries within the bounding box, but only sider the query in listing 10. This query is asking for 10 monuments exist in the triple store, the basic graph all monuments within the National Mall. Parliament’s pattern would execute first and the spatial relationship query optimizer splits the query into blocks for execu- would be tested for each of the 10 results. tion based on how the variables in the query are used In addition to optimizing pattern order, optimiz- and what special operations occur in the query. In this ing spatial operations can provide significant per- instance, since the predicate, geo:within, is an in- formance benefits. Consider the query in listing 11. dex property function, the triple containing the predi- This query is deceptive in that it appears to be ask- cate is considered as one query block. The rest of the ing a simple geospatial question: "What are all the query is a simple basic graph pattern containing two geometries within 10km of the feature described by triples describing ?m. The basic graph pattern is ana- ?". lyzed and partitioned so that no variable crosses par- However, there is no way for Parliament’s query op- titions. For this query, this generates two partitions. timizer to know a priori what the distances between When the query is executed, the Parliament query op- geometries are. While some implementations of Geo- timizer has two choices: (1) it can execute the spatial SPARQL may be optimized to handle cases like this, in part first and then match to the graph pattern, or (2) Parliament this query would not use the spatial index it can execute the graph pattern first, then execute the at all; the geometries for all parks would be found be- spatial operation. The optimizer will decide what to fore checking their distance. Table 2 shows that there do based on which path estimates it will provide the are nine features matching this query. Listing 12 shows fewest result bindings. a more version of this query that is Parliament can op-
12 Listing 11: Example Query 5 Listing 12: Example Query 5 - Optimized SELECT ?x SELECT ?x WHERE { WHERE { GRAPH { GRAPH { geo:hasGeometry ?geo1 . geo:hasGeometry ?geo1 . ?geo1 geo:asWKT ?wkt1 . ?geo1 geo:asWKT ?wkt1 . ?x geo:hasGeometry ?geo2 . BIND (geof:buffer(?wkt1, 10000, ?geo2 geo:asWKT ?wkt2 . units:m) as ?buff) . ?x geo:hasGeometry ?geo2 . BIND (geof:distance(?wkt1, ?wkt2, [ a geo:Geometry ; units:m) as ?distance) . geo:asWKT ?buff ] geo:sfContains FILTER (?distance < 10000) ?geo2 . } } } } Table 2 tial context attached. However, the vast majority of this Example Query 5 Results geospatial context cannot be utilized for spatial queries x because the hosting SPARQL endpoints cannot per- http://sws.geonames.org/4199542/ form them. In the following section, we discuss four http://sws.geonames.org/7242246/ datasets that are representative of the disparate types http://sws.geonames.org/4183291/ of data that can be integrated together via there spatial http://sws.geonames.org/4201877/ context. We then demonstrate this integration on two http://sws.geonames.org/4224307/ datasets using Parliament. http://sws.geonames.org/4192596/ 6.4.1. Geospatial Data Sets http://sws.geonames.org/4212826/ GeoNames20 provides information for over eight http://sws.geonames.org/4192776/ million geospatial features. The data is exposed via an http://sws.geonames.org/4194405/ RDF webservice21 that exposes information on a per resource basis. The geospatial aspect is represented us- timize. This version buffers the geometry for the for ing the W3C Basic Geo vocabulary. There is no abil- by ity, however, to perform any type of SPARQL query to 10000 meters and then uses that buffer to test the retrieve data. geo:sfContains relationship. Parliament can take Another significant source of geospatial data is DB- this query and run the spatial component against the pedia22 . As described in [6], data from Wikipedia23 spatial index in order to reduce the amount of re- is extracted into RDF. Many of these extracted enti- sults that need to match the rest of the query. Run- ties are geospatial in nature (cities, counties, countries, ning the un-optimized query on our Parliament Geo- landmarks, etc. . . ) and many of these entities already SPARQL endpoint (described in the next section) takes contain some geospatial location information. The data 6.967 seconds. The optimized version, however, runs also contains owl:sameAs links between the DBpe- in 0.047 seconds. A future version of the Parliament dia and GeoNames resources. However, without any query optimizer will be able to optimize queries like spatial computation predicates, this geospatial infor- listing 11 automatically. mation can only be retrieved "as is." A query like 6.4. GeoSPARQL and Linked Data 20 http://www.geonames.org 21 http://www.geonames.org/ontology/ As the Semantic Web materializes on the internet in documentation.html the form of Linked Data, there is an increasing amount 22 http://www.dbpedia.org of structured data available with some sort of geospa- 23 http://www.wikipedia.org
13 "Show all cities within 50 miles of Arlington, VA with can share data while providing access to geospatial se- a population of at least 100,000 people in which at mantics. least one famous person was born" is not possible, even For the following examples, we have processed a though the data exists to support it. subset of the GeoNames RDF dataset26 and the USGS The linked data community has released the Linked- GeoSPARQL data for Atlanta, GA. This data is acces- GeoData data set [2]. This data set is a spatial knowl- sible via a Parliament SPARQL endpoint with a Geo- edge base, derived from Open Street Map24 and is SPARQL spatial index27 . The queries in listings 11, linked to DBpedia and GeoNames resources. It con- 12, and 14 can be executed against this data. The in- tains over 200 million triples describing the nodes and dex supports indexing data conforming to WKT and paths from OpenStreetMap. The data set is accessible GML serialization and GeoSPARQL queries including via SPARQL endpoints running on the OpenLink Vir- those with Simple Features, Egenhofer, and RCC8 re- tuoso platform. A REST interface to LinkedGeoData lations, the non-topological query functions and com- is also provided. mon topological query functions. Query rewriting is Another source of geospatial data is the United not supported at this time. States Geological Survey (USGS). The USGS has re- As so much data is represented as individual lat- leased a SPARQL endpoint25 for triple data derived itude and longitude data using the W3C Basic Geo from The National Map[30], a collaborative effort to vocabulary, including that provided by GeoNames, it deliver usable topographic information for the United is desirable to be able to convert it easily into Geo- States. This dataset is much more specific and special- SPARQL. It is trivial to provide functions that take ized than the data that is provided by DBpedia and a latitude and longitude pair and convert them into GeoNames. It includes geographic names, hydrogra- a GeoSPARQL point. Parliament provides SPARQL phy, boundaries, transportation, structures, and land property functions, spatial:toWKTPoint and cover. The group has attempted to follow the forthcom- spatial:toGMLPoint which take a latitude, lon- ing GeoSPARQL specification, though some aspects gitude, and optional spatial reference system identi- of GeoSPARQL have changed slightly since the data fier as arguments and return a sf:wktLiteral or has been published. Due to a lack of available Geo- gml:gmlLiteral representation of a point. This SPARQL triple stores, the published dataset includes makes it possible to load and index existing spatial data a pre-computation of all of the topological relation- sets without having to regenerate existing RDF. Af- ships between entities. A query such as "Show all rail ter loading the GeoNames data into Parliament, it was lines that cross rivers" is in fact possible to answer by aligned with GeoSPARQL using the query in listing looking at the current precomputed data. However, this 13. This query explicitly assigns GeoNames features to means that if a new entity is added, the knowledge base be GeoSPARQL features, creates a new geo:Point needs to compute everything that the entity is related resource containing the point information, and links to and update those entities as well. If this data was not the feature to the geometry. precomputed, the only way to answer the query would The USGS provides their RDF data in a format be via indexing the data and querying the relations with that is similar to GeoSPARQL. The geospatial data a relationship predicate. from The National Map contains polygon, polyline, and point data. The data conforms to an earlier revision 6.4.2. Integration in Parliament via GeoSPARQL of the GeoSPARQL standard and contains features GeoSPARQL provides the means to link geospa- and geometries, but lacks typed literals for the geo: tial datasets together, resulting in the possibility of asWKT and geo:asGML property values. For this new, meaningful entailments. As it is simply not pos- data, the constructor functions for the GeoSPARQL lit- sible to pre-compute all of the relations between all of eral datatypes can be used to create correctly typed val- the available geospatial datasets, enriching the existing ues at query time. The existing spatial relations state- datasets with GeoSPARQL representations, and creat- ments in the dataset were ignored when processing the ing indexes for the data is one way that data providers data. 24 http://www.openstreetmap.org 26 http://download.geonames.org/ 25 http://usgs-ybother.srv.mst.edu:8890/ all-geonames-rdf.zip sparql 27 http://geosparql.bbn.com
14 Listing 13: GeoNames Conversion Query CONSTRUCT { Listing 14: Atlanta Schools Near Rail Lines Query ?feature a geo:Feature ; geo:hasGeometry [ SELECT DISTINCT ?school ?distance a geo:Point ; WHERE { geo:asWKT ?wkt GRAPH { ] . # approximate rectangle of Atlanta } BIND ("POLYGON((-84.445 33.7991, WHERE { -84.445 33.7069,-84.331 ?feature a gn:Feature ; 33.7069,-84.331 wgs84_pos:lat ?lat ; 33.7991,-84.445 33.7991))"^^sf wgs84_pos:long ?long . :wktLiteral AS ?place) . BIND (spatial:toWKTPoint(?lat, ? long) as ?wkt) . ?school a gn:Feature ; } geo:hasGeometry ?school_geo ; gn:featureCode gn:S.SCH . 6.4.3. GeoSPARQL Data Query ?school_geo geo:sfWithin [ a geo: Once GeoSPARQL data exists for both GeoNames Geometry; geo:asWKT ?place ] ; and the USGS and is loaded into a GeoSPARQL ca- geo:asWKT ?school_wkt . pable knowledge base, such as Parliament, interest- ing geospatial questions can be posed. Consider the # buffer schools 100m following query: "What are all the schools near At- BIND (geof:buffer(?school_wkt, lanta, GA that are within 100 meters of a railway?" 100, units:m) AS ?s_buff) . GeoNames provides point data for buildings, includ- ing schools, while the USGS data contains polyline # find rail links within buffer data for rail lines as well as polygons for differ- ?rail a trans:railFeature ; ent regions. Listing 14 shows the GeoSPARQL for- geo:hasGeometry ?rail_geo . mulation for this question. This query takes advan- tage of several features of the language. First, the # only get railroads within geo:sfWithin predicate is used to determine what Atlanta schools exist within a boundary for Atlanta (as defined ?rail_geo geo:sfWithin [ a geo: by a polygon in the query). Each school geometry is Geometry; geo:asWKT ?place ] ; then buffered by 100 meters using the geof:buffer geo:asWKT ?rail_wkt_s . function. The rail features are retrieved and checked to see if they fall within this buffer. Finally, for all rail line # convert string to WKT literal segments that intersect the buffer, the actual distance to BIND (sf:wktLiteral(?rail_wkt_s) the school is calculated using the geof:distance AS ?rail_wkt) . function. Table 3 displays the school resources and the FILTER (geof:sfIntersects(? distance to the nearest rail line segment. This query, rail_wkt,?s_buff)) . however, could be made more effective if there was an BIND (geof:distance(?school_wkt,? equivalent to the ST_DWithin from OGC Simple Fea- rail_wkt,units:m) AS ?distance tures Specification [25]. ) . While the sample query assumes everything is con- } tained in a single graph, it is not unusual for the data to } exist in several different graphs or endpoints. In fact, it ORDER BY ASC(?distance) would be ideal to not have to replicate data and to be able to query it remotely through the dataset provider. Until the means for query federation, such as the mech-
15 Table 3 ized. This has allowed us to identify and resolve addi- Atlanta Schools Near Rail Lines Results tional issues with the specification earlier. school distance We would also like to thank the OGC and the other http://sws.geonames.org/4183400/ 44.09248276546987 members of the GeoSPARQL working group for their http://sws.geonames.org/7242795/ 80.33873752510539 continued efforts to bring this marriage of the geospa- http://sws.geonames.org/7242287/ 91.9176767544314 tial community and the Semantic Web to life. anism discussed in [27], are widely supported, query- Listing 15: RDF Prefixes ing across remote linked datasets will be difficult. Geo- SPARQL queries should be compatible with query fed- apf: plications. In lieu of query federation, querying across ex: amples is actually contained in two different graphs. A gn: ever, enables a shorter and simpler query. gu: geo: geof: of previous work on combining RDF and OWL with sf: geospatial data. Its creation means that geospatial data gml: with an expectation of efficient geospatial queries. os: nally utilize the significant amount of geospatial con- ose: Many triple stores, though they do not yet support osg: thereof. This is an indicator of how important geospa- owl: standard is released, the relevant vendors will move to rdfs: tently exchange and process geospatial data. spatial: truly realize the geospatial Semantic Web, technolo- time: gies such as GeoSPARQL and implementations like trans: it useful both for working on geospatial Semantic Web applications and understanding the specification. units: wgs84_pos: 8. Acknowledgments xsd: We would like to thank the people at the Cen- virtrdf: SPARQL even before the specification has been final-
You can also read