Publication Date: 2019-12-17
Approval Date: 2019-11-22
Submission Date: 2019-10-31
Reference number of this document: OGC 19-021
Reference URL for this document: http://www.opengis.net/doc/PER/t15-D001
Category: OGC Public Engineering Report
Editor: Esther Kok, Stephane Fellah
Title: OGC Testbed-15: Semantic Web Link Builder and Triple Generator
COPYRIGHT
Copyright © 2019 Open Geospatial Consortium. To obtain additional rights of use, visit http://www.opengeospatial.org/
WARNING
This document is not an OGC Standard. This document is an OGC Public Engineering Report created as a deliverable in an OGC Interoperability Initiative and is not an official position of the OGC membership. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an OGC Standard. Further, any OGC Public Engineering Report should not be referenced as required or mandatory technology in procurements. However, the discussions in this document could very well lead to the definition of an OGC Standard.
LICENSE AGREEMENT
Permission is hereby granted by the Open Geospatial Consortium, ("Licensor"), free of charge and subject to the terms set forth below, to any person obtaining a copy of this Intellectual Property and any associated documentation, to deal in the Intellectual Property without restriction (except as set forth below), including without limitation the rights to implement, use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Intellectual Property, and to permit persons to whom the Intellectual Property is furnished to do so, provided that all copyright notices on the intellectual property are retained intact and that each person to whom the Intellectual Property is furnished agrees to the terms of this Agreement.
If you modify the Intellectual Property, all copies of the modified Intellectual Property must include, in addition to the above copyright notice, a notice that the Intellectual Property includes modifications that have not been approved or adopted by LICENSOR.
THIS LICENSE IS A COPYRIGHT LICENSE ONLY, AND DOES NOT CONVEY ANY RIGHTS UNDER ANY PATENTS THAT MAY BE IN FORCE ANYWHERE IN THE WORLD. THE INTELLECTUAL PROPERTY IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE DO NOT WARRANT THAT THE FUNCTIONS CONTAINED IN THE INTELLECTUAL PROPERTY WILL MEET YOUR REQUIREMENTS OR THAT THE OPERATION OF THE INTELLECTUAL PROPERTY WILL BE UNINTERRUPTED OR ERROR FREE. ANY USE OF THE INTELLECTUAL PROPERTY SHALL BE MADE ENTIRELY AT THE USER’S OWN RISK. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR ANY CONTRIBUTOR OF INTELLECTUAL PROPERTY RIGHTS TO THE INTELLECTUAL PROPERTY BE LIABLE FOR ANY CLAIM, OR ANY DIRECT, SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM ANY ALLEGED INFRINGEMENT OR ANY LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR UNDER ANY OTHER LEGAL THEORY, ARISING OUT OF OR IN CONNECTION WITH THE IMPLEMENTATION, USE, COMMERCIALIZATION OR PERFORMANCE OF THIS INTELLECTUAL PROPERTY.
This license is effective until terminated. You may terminate it at any time by destroying the Intellectual Property together with all copies in any form. The license will also terminate if you fail to comply with any term or condition of this Agreement. Except as provided in the following sentence, no such termination of this license shall require the termination of any third party end-user sublicense to the Intellectual Property which is in force as of the date of notice of such termination. In addition, should the Intellectual Property, or the operation of the Intellectual Property, infringe, or in LICENSOR’s sole opinion be likely to infringe, any patent, copyright, trademark or other right of a third party, you agree that LICENSOR, in its sole discretion, may terminate this license without any compensation or liability to you, your licensees or any other party. You agree upon termination of any kind to destroy or cause to be destroyed the Intellectual Property together with all copies in any form, whether held by you or by any third party.
Except as contained in this notice, the name of LICENSOR or of any other holder of a copyright in all or part of the Intellectual Property shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Intellectual Property without prior written authorization of LICENSOR or such copyright holder. LICENSOR is and shall at all times be the sole entity that may authorize you or any third party to use certification marks, trademarks or other special designations to indicate compliance with any LICENSOR standards or specifications.
This Agreement is governed by the laws of the Commonwealth of Massachusetts. The application to this Agreement of the United Nations Convention on Contracts for the International Sale of Goods is hereby expressly excluded. In the event any provision of this Agreement shall be deemed unenforceable, void or invalid, such provision shall be modified so as to make it valid and enforceable, and as so modified the entire Agreement shall remain in full force and effect. No decision, action or inaction by LICENSOR shall be construed to be a waiver of any rights or remedies available to it.
None of the Intellectual Property or underlying information or technology may be downloaded or otherwise exported or reexported in violation of U.S. export laws and regulations. In addition, you are responsible for complying with any local laws in your jurisdiction which may impact your right to import, export or use the Intellectual Property, and you represent that you have complied with any regulations or registration procedures required by applicable law to make this license enforceable.
- 1. Subject
- 2. Executive Summary
- 3. References
- 4. Terms and definitions
- 5. Overview
- 6. Introduction
- 7. Data Integration
- 8. Data Fusion
- 9. Shapes Constraint Language (SHACL)
- 10. Data Sources
- 11. Implementation
- 12. Future Work
- Appendix A: Appendix A
- Appendix B: Revision History
- Appendix C: Bibliography
1. Subject
This OGC Testbed 15 Engineering Report (ER) describes a generalized approach towards performing data fusion from multiple heterogeneous geospatial linked data sources. The specific use case is semantic enrichment of hydrographic features provided by Natural Resources Canada (NRCan). The ER attempts to define and formalize the integration pipeline necessary to perform a fusion process for producing semantically coherent fused entities.
2. Executive Summary
The web and enterprise intranets have facilitated access to a vast amount of information. When data from multiple sources can be combined together, its usefulness increases dramatically. Users want to query information from different sources, combine it and present it into a uniform, complete, concise and coherent view using an information integration system. However, today there is no well-defined multi-modal data integration framework available. Such a framework can provide the user a complete yet concise and coherent overview of all existing data without the need to access each of the data sources separately. Complete because no object is forgotten in the result, concise because no object is represented twice, and coherent because the data presented to the user is without logical contradiction. Ensuring coherence is difficult because information about entities is stored in multiple sources and because of semantic heterogeneity.
A number of ontologies for supporting correlation and semantic mediation are defined using the new World Wide Web Consortium (W3C) Shape Constraint Language, as well as a correlation engine that has been implemented to be accessible through an Application Programming Interface (API) based on Representational State Transfer (REST). Future work will need to implement semantic mediation and fusion engine.
This engineering report makes the following recommendations for future work:
-
The implementation of a semantic mediation engine supported by a mediation ontology.
-
The formalization of different types of conflict and resolution strategies, enabled by ontologies.
-
The demonstration of a complete fusion pipeline that includes semantic mapping, correlation, mediation and integration of entities from multiple sources defined in different ontologies.
-
Extension of the REST-based Fusion Service API to support Create, Read, Update and Delete (CRUD) functions.
-
Integration of Semantic Data Cubes with Conversational Agents.
2.1. Document contributor contact points
All questions regarding this document should be directed to the editor or the contributors:
Contacts
Name | Organization | Role |
---|---|---|
Esther Kok |
Solenix |
Editor/Contributor |
Stephane Fellah |
ImageMatters |
Editor/Contributor |
Nicola Policella |
Solenix |
Contributor |
2.2. Foreword
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. The Open Geospatial Consortium shall not be held responsible for identifying any or all such patent rights.
Recipients of this document are requested to submit, with their comments, notification of any relevant patent claims or other intellectual property rights of which they may be aware that might be infringed by any implementation of the standard set forth in this document, and to provide supporting documentation.
3. References
The following normative documents are referenced in this document.
-
OGC: OGC 09-026r2, OGC Filter Encoding 2.0 Encoding Standard - With Corrigendum, 2014
-
OGC: OGC 11-052r4, OGC GeoSPARQL- A Geographic Query Language for RDF Data, 2011
-
OGC: OGC 14-106. Unified Geo-data Reference Model for Law Enforcement and Public Safety, 2014
-
W3C: SKOS Simple Knowledge Organization System Reference, W3C Recommendation 18 August 2009
-
W3C: Data Catalog Vocabulary (DCAT), W3C Recommendation 16 January 2014
-
W3C: SPARQL Protocol and RDF Query Language (SPARQL), last visited 12-09-2016
-
W3C: JSON-LD 1.1: A JSON-based Serialization for Linked Data - Candidate Recommendation, 2019
-
W3C: JSON-LD 1.0: A JSON-based Serialization for Linked Data - W3C Recommendation, 2014
-
ISO: ISO 19101-2:2018 Geographic information — Reference model — Part 2: Imagery
4. Terms and definitions
For the purposes of this report, the definitions specified in Clause 4 of the OWS Common Implementation Standard OGC 06-121r9 shall apply. In addition, the following terms and definitions apply.
- ● Unstructured Data
-
Traditional unstructured data is made of text documents and other file types such as videos, audios, images. Large amount of unstructured data is coming from sources external from enterprise such as social media. Natural language text is usually considered Unstructured Data.
- ● Semi-structured Data
-
Semi-structured Data is technically a subset of unstructured data and refers to tagged or taggable data that does not strictly follow a tabular or database record format. Examples include languages like XML, JSON and HTML.
- ● Structured Data
-
Structured data is "data that resides in fixed fields within a record or file". Examples include tables, spreadsheets, or databases (relational or NoSQL). Structured data is mostly understood today as data that conforms to a well-known schema (RDBMS schema, XML schema, JSON Schema). Schema defines the structure and syntactic constraints on the data.
- ● Triple
-
A Triple is the most high-level abstraction in the semantic web. It describes a statement using a triple of "Subject - Predicate - Object". URIs are used to identify the subject of the statement. The object of the statement can be another URI or a literal like a string or number. The triple model is a minimalist model to capture any form of data including table, tree, graph.
- ● Semantic Mediation
-
Semantic mediation is defined as the transformation from one conceptual model to another, in particular from one ontology to another. Instances of the target classes are created from the values of instances of the source classes.
4.1. Abbreviated terms
-
API Application Programming Interface
-
DCAT Data Catalog Vocabulary
-
DCAT-AP DCAT Application Profile for Data Portals in Europe
-
EARL Evaluation and Report Language EU European Union
-
ER Engineering Report
-
GeoDCAT-AP Geographical extension of DCAT-AP
-
ISO International Organization for Standardization
-
N3 Notation 3 format
-
OGC Open Geospatial Consortium
-
OWL Web Ontology Language
-
RDF Resource Description Framework
-
RDFS RDF Schema
-
SHACL SHape Constraint Language
-
SKOS Simple Knowledge Organization System
-
SPARQL SPARQL Protocol and RDF Query URI Uniform Resource Identifier
-
TTL Turtle Format
-
URI Uniform Resource Identifier
-
URL Uniform Resource Locator
-
W3C World Wide Web Consortium
-
XML eXtensible Markup Language
5. Overview
Section 6: Introduction defines the problem of enriching geospatial data with semantic data using hydro features as a specific use case while also addressing the status quo of semantic enrichment.
Section 7: Data Integration discusses the topic of data integration, outlines the challenges and defines a formal approach for performing data fusion from multiple heterogeneous data sources.
Section 8: SHACL introduces the new W3C Shapes Constraint Language (SHACL) standard, a language for validating RDF graphs against a set of conditions, depicting shapes of graph data, and providing validation, transformation and inference rules. This ontology is used in the correlation and semantic mediation ontology and simplify previous mediation ontologies designed in previous OGC Testbeds 10, 11 and 12.
Section 9: Data Sources describes the main data sources that are used for the implementation of the integration pipeline which can perform data fusion from multiple heterogeneous geospatial linked data sources.
Section 10: Implementation presents the implementation of the integration pipeline and main work done in this Testbed thread. The section shows the architectural overview, the approach for correlation phase and semantic mediation and the REST API designed to support the correlation phase.
Section 11: Future Work addresses future work in this field.
Appendix A documents the ontologies developed during this Testbed.
6. Introduction
This section defines the problem of enriching geospatial data with semantic data using hydrographic features as a specific use case. This section also addresses the status quo of semantic enrichment.
6.1. Problem Definition
Efforts in OGC Testbeds-11 [1] and 12 [2, 3] targeted semantic mediation support. The focus was exploring the high-level description of ontologies and the metadata needed to enable search on controlled vocabularies. As part of the Testbed-12 activity, a REST API was developed to access vocabulary metadata. This Semantic Registry Service was used as an aide in a first pass of the data in the further enrichment of an existing knowledge base.
The work described in this ER utilizes the vocabulary described in the Semantic Registry Information Model (SRIM) as an enabler for further knowledge base population. The SRIM is defined as a superset of the W3C DCAT standard and encoded as an OWL (Web Ontology Language) ontology. However, there should be restrictions in place (such as mandatory, recommended or optional fields) that cannot be captured with OWL, and require human interpretation. The Shape Constraint Language (SHACL) [4, 5] is a W3C standard that provides a framework to define the shape of the graph data. Proposed future work included investigation of SHACL shapes for its applicability to application profiles, form generation and data entry, data validation and quality control of linked data information.
The work described in this ER was focused on enriching geospatial structured data sources. The tools created in this Testbed harvest semantic data sources and attempt to correlate and fuse this information. The hydrological features for the Richelieu River/Watershed were the data source for this use case. One of the challenges in the work was to overcome mapping the ontology vocabulary in the Semantic Registry Information model to other ontologies used in geospatial databases. The alignment and "mapability" needs to be addressed to confirm the relation and/or add more information to the relation.
The high-level goal of the work described in this ER was to perform knowledge base population on the hydrological entities in the Richelieu River/Watershed. The ultimate goal was to see if the solution for this use case could be generalized for multiple problems.
6.2. Status Quo
This section describes the current challenges related to data integration and fusion.
6.2.1. Lack of a Multi-Modal Data Integration Framework
Inexpensive networks such as the web and enterprise intranets have facilitated the access to a vast amount of information. An information integration system can be utilized to combine and present data from multiple sources to increase the usefulness of the information. Unfortunately, today there is no well-defined multi-modal data integration framework available that can present a complete, concise, and coherent representation of the data to the user.
6.2.2. Lack of a Unified, Extensible Logical Graph Model
Current end-to-end data processing environments are plagued by inconsistent data models and formats that were developed in closed technology platform/program settings, are often proprietary, are not easy to extend, and are not suited for satisfying “Unified Knowledge Graph” needs. Such an eclectic approach leads to logical inconsistencies that must be addressed by human-intensive tools and processes, which impedes operational readiness and time-to-action. Logical inconsistencies can also render data useless. Moreover, the automation of graph workflows is severely inhibited by the lack of a well-known, standards-based logical graph model that works across many systems and organizations.
6.2.3. Lack of Formal Semantics
Current data processing environments do not formally model and explicitly represent semantics, nor do they address the semantic validity and consistency of their data and data models. This leads to ambiguity in most data, which increases the cognitive workload on users, severely inhibits machine-to-machine interoperability and understanding, and greatly reduces the potential for task/workflow automation (machine reasoning) and multi-source data integration.
6.2.4. Lack of Consistent Means for Recording and Exploiting Provenance and Pedigree Information (Source and Method)
Few systems formally capture and exploit the provenance and pedigree (P&P) of sources and methods. Moreover, rarely is P&P captured for all significant information elements, nor done consistently between systems. This leads to downstream processing impediments where users are unable to conflate or assess their confidence in data. In turn, this uncertainty inhibits the automation of data processing workflows, and increases the operating burden on users.
What is needed is a formal recording and tracking of P&P information about the sources and methods of all major information elements for all repository holdings. This is essential to database integrity management and answering questions users may have about the source and method for any information that is subsequently extracted, disseminated, and exploited. Formal recording and tracking of P&P information must be built into the database, and handled as a key function of database services.
7. Data Integration
To improve the usefulness of data from different sources, an information integration system can be defined to present the data to the user. This section discusses the challenges of data integration and discusses an approach for performing data fusion from multiple heterogeneous data sources.
7.1. Classifying Data: Structured, Unstructured, Semi-structured
Before we delve into how to implement a robust integration system, it is important that we understand the different types of data that are available on the network. Data can be placed into three categories: unstructured, semi-structured, structured.
-
Unstructured Data are data that are not so easily organized or formatted. Traditional unstructured data are made of text documents and other file types such as videos, audios, and images. Large volumes of unstructured data are coming from sources external from enterprises, such as social media, and constitute the vast majority of available data on the web (about 80 percent of all available data [6]). The volume increases every year. Collecting, processing, and analyzing unstructured data presents a significant challenge. With more information becoming available via the web, and most of which is unstructured, finding ways to use the data has become a vital strategy for many businesses. Recent advances in Deep Learning have addressed many challenges that seemed impossible to solve just 10 years ago. Deep Neural Networks are now capable of identifying objects in images and video scenes, robustly identify entities and relationships, and understand context and meaning in texts. (ex. BERT [7], OpenAI GPT-2 [8], RoBERTA [9] ). While these types of data are out of scope for this Testbed, investigating the application of these latest advances in future Testbeds will help tap into the vast knowledge information buried in this information.
-
Semi-structured Data is technically a subset of unstructured data and refers to tagged or taggable data that does not strictly follow a tabular or database record format. Examples include languages like XML, JSON and HTML.
-
Structured Data is "data that resides in fixed fields within a record or file" (Webopedia). Examples include tables, spreadsheets, or databases (relational or NoSQL). Structured data is most understood today as data that conforms to a well-known schema (RDBMS schema, XML schema, JSON Schema). Schema defines the structure and syntactic constraints on the data.
While this categorization of data is well understood and widely adopted in the industry, it misses an important aspect: whether the data can be understood by machine-based algorithms or not. Understanding data means that computers can unambiguously interpret the meaning of information and be capable of inferring new information from data. This capability is the realm of Linked Data and the Semantic Web. Semantics are required for data and service interoperability. Semantics are also imperative for machine-to-machine understanding, reasoning (inference), and automation. Semantics greatly aid in search and navigation. They constitute the unifying means for interrelating heterogeneous data. They also aid in data abstraction, categorization, organization, and validation. Finally, semantics give data context and unambiguous meaning.
Many people use the term “data format” to suggest that they have a common, interoperable data model. In the world of formal data modeling, this is considered to be flawed thinking that is fraught with system interoperability challenges. This thinking tends to surface in rushed stovepipe development efforts, and in settings where data modeling experts are absent. (System engineers and software developers who lack data modeling expertise notoriously skirt formal data models, and often misinterpret or vary from data standards.) For example, many developers make the simple mistake of treating a file format as their data model for interoperability. First of all, a (file) format is simply a convenient encoding for point-to-point data exchange purposes. A format is not a logical data model, per se, although it may have a formal logical data model behind it. Whereas a well-designed file format like Shapefile or KML may resolve schema and syntax issues, they do not explicitly deal with the semantics and context of the content it exchanges. Likewise, a collection of (file) formats is not a data model. File formats may actually hinder interoperability because each format requires custom mapping and translation software to another format, and the transformed results often do not align with sound, industry-approved coherent data models that were designed with interoperability in mind (as well exemplified by the work at OGC). Whereas file formats may be effective convenience mechanisms for point-to-point data exchanges, they are not good interoperability mechanisms for open service-based platforms and environments with many-to-many data exchange nodes. In summary, they tend to keep us from achieving the desired higher level of uniformity, logical consistency, and semantic harmony we seek for a System-of-Systems.
7.2. Linked Data and Semantic Web
In a famous article of Scientific American [10] Tim Berners-Lee described the aim of the Semantic Web bringing the web to its full potential. To enable wider adoption of the Semantic Web, the term of Linked Data was introduced in 2008 [11], which provides a simplified view of semantic web as a web of linkage between data nodes. The idea of linked data is similar to the web of hypertext, but the semantic web is not merely about publishing data on the web, but more about making links in such a way that a person or a machine can explore the data. The linked data leads to other related data. The semantic web is also constructed in such a way that it can be parsed and reasoned about. The web of hypertext is constructed with links anchored in HTML documents, but the semantic web is constructed in such a way that arbitrary links between entities are described by Triples in RDF. URIs are used to identify any kind of concept. [12]
The following describes some of the important concepts of the semantic web:
-
URI: Uniform Resource Identifier is a sequence of characters that unambiguously identifies an abstract or physical resource. A URI can be used as a reference inside an RDF graph. Any resource describing any real-world or an abstract concept is identified by a URI, which is unambiguous and can be defined in a decentralized way (using domain ownership).
-
Triple: A Triple is the most high-level abstraction in the semantic web. It describes a statement using a triple of "Subject - Predicate - Object". URI’s are used to identify the subject of the statement. The object of the statement can be another URI or a literal like a string or number. The triple model is a minimalist model to capture any form of data including a table, tree, or graph.
-
RDF: The Resource Description Framework (RDF) is a data model for representing any kind of information in the web (using the triple model). RDF is intended as a base notation for a range of extended notations such as OWL and RDF schema.
-
RDFS: RDF Schema defines classes that represent concepts for the triples. RDFS captures information about the types of relationships between facts and add meaning to the facts. This allows definition of sub-types of more general types.
-
OWL: Web Ontology Language (OWL) is a Semantic Web language designed to represent rich and complex knowledge about things, groups of things, and relations between things. OWL adds semantics to RDF schema, also expressed in Triples. OWL allows for the deeper specification of the properties of classes, creating possibilities to not only join data from different sources as linked data, but also to define inferencing rules (transitive, symmetric, and inverse relationships for example).
The rise of the semantic web and linked open data has helped create a large number of available open data sources that could be explored. They could contain controlled vocabularies that structure and constrain the interpretation of the available data. Vocabularies can for example be defined as ontologies using RDF Schema and OWL, linked data with SPARQL, or as constraints that can be defined using SHACL. Ontologies define the meaning of the linked data, which can be modelled as a directed labelled graph. The nodes represent resources and the edges represent properties, assigned to the meaning in the ontology. Access to vocabularies is often provided at a SPARQL endpoint. SPARQL and its geospatial extension GeoSPARQL are now well-established standards for querying Linked Data representation that contains geospatial information.
SPARQL: SPARQL Protocol and RDF Query Language, the generic RDF query language.
Unfortunately, in practice a lot of resources and properties do not have resolvable resource URIs, so the exploration of other linked data is hindered. The principles of 5-star linked open data, which are described in the next subsection, aim to resolve some of the hindrance. Figure 1 illustrates the 5-star principles.
7.3. 5-Star Linked Open Data
As a means to encourage individual users and governments to implement and improve on linked data [12] in 2010 Tim Berners-Lee developed a 5-star system for grading linked open data. Linked data is not necessarily linked open data, where the data is published under an open license. Linked data can also be 5-star (see Figure 1) within an internal system, but linked open data is data that follows the following graded scale:
-
On the web: available on the web (whatever format) (but with an open license in order to be Open Data)
-
Machine-readable data: available as machine-readable structured data
-
Non-proprietary format: as (2) plus non-proprietary format
-
RDF standards: all the above plus use of open standards from W3C (RDF and SPARQL) to be identifiable, so that others can link to the data
-
Linked RDF: all the above plus linking the data to other data to provide context
7.4. Semantic Mediation
Semantic mediation is defined as the transformation from one conceptual model to another, in particular from one ontology to another. Instances of the target classes are created from the values of instances of the source classes. Related work on this topic was documented in OGC Testbed-11 Symbology Mediation ER (OGC 15-058 [1]).
Semantic Mediation was addressed to some extent in OWS-8 [13], OWS-9 [14] and OWS-10 [15]. Those Testbeds were mostly focused on performing semantic mediation for taxonomies. For example, gazetteers such as the Geographic Names Information System (GNIS), GEOnet Names Server, and Geonames often use different taxonomies for classifying feature types. To support semantic mediation, mappings are required from one concept in a source taxonomy to another one in the target taxonomy (using SKOS mapping relationships such as skos:exactMatch
, skos:broadMatch
, skos:closeMatch
). The semantic mediation was demonstrated using the OGC Web Feature Service-Gazetteer (WFS-G), however the mediation was performed as black box on syntactic representation of the features (using Geography Markup Language (GML)). In OWS-10, the hydrology sub-thread of CCI [15] attempted to address a more general approach for mediation by defining some mapping between two different hydrologic models. However, the solution was based on UML tools that perform the mapping and no formal model was defined to encode the semantic mapping. The mapping between different hydrological models was formalized through mediation by means of shared semantics. A two-step mapping approach was applied using the HY_Features model as mediating model. In the first step, the hydrological models is mapped to common feature concepts of HY_Features. In the second step, the HY_Features model is “re-mapped” to the target models. To achieve semantic interoperability among the National Hydrography Dataset Plus (NHD)+ - and National Hydrographic Network (NHN) data models, two separate mappings are required: (1) mapping of NHD+ features to the equivalent concepts of HY_Features, and (2) mapping NHN features to HY_Features equivalents.
To address the semantic mediation of symbology, Testbed-11 addressed semantic mediation by representing information as linked data (see OGC 15-058). The goal of Testbed-11 was to formally address the semantic mediation for taxonomies by defining extensions to GeoSPARQL (geosparql:skosMatch) to perform the semantic mapping and providing an extensible, sharable encoding of the semantic mappings that can be processed by machine. A RESTful Semantic Mediation Service was demonstrated to perform semantic mapping between the Homeland Security Working Group (HSWG) Incident Model to the EMS Incident Model. For this purpose, existing linked data standards (RDF, OWL, SPARQL) were leveraged to represent semantic mappings. These semantic mappings can be managed by a semantic mediation service to perform transformation between two models. OGC 15-058 [1], Symbology Mediation Engineering Report from Testbed-11 describes the basic principles of semantic mediation using the example of two portrayal ontologies that need to be aligned in an ad hoc manner. In Testbed-12 (OGC 16-059 [2]),the semantic mediation work in Testbed-12 was closely related to the semantic portrayal work described above and built on the achievements from Testbed-11. Testbed-12 focused on the usage of a schema registry to store information about schemas and schema mappings to support ad hoc transformations between source and a target schemas. Schema mappings were considered as a simple form of semantic mediation, but were defined without explicit formalization of the underlying semantic knowledge required to map from one schema to another. For that reason, the idea was to design a Semantic Mediation Service REST API and integrate the API with the Semantic Registry and the CSW ebRIM profile for Schema Registry.
Since the conclusion of Testbed-13, W3C has standardized the SHACL specification [5]. The SHACL specification shows a lot of overlap with the work performed during Testbed-11 (see OGC 15-058). In particular, SHACL defines Shapes, Parameter, Function and Rules that have been defined in the semantic mapping ontology from Testbed-11. A comparison of SHACL with OWL was performed during Testbed-14 (OGC 18-094r1) and a model was defined to represent application profile for linked data. Within this context, an application profile defines a subset of concepts and properties from one or more ontologies to support mediation and in many use cases, semantic mediation is needed between two application profiles. W3C is currently conducting an effort to define application profiles for Linked data [16] and many application profiles such as DCAT-AP and GeoDCAT are encoded using the SHACL standard. One of the main efforts of Testbed-15 was to update the mapping ontology by using the SHACL standard. By doing so, the semantic mediation can be considerably simplified and be usable with emerging SHACL APIs. This would favor interoperability and sharing of semantic alignments.
7.5. Deduplication
When integrating multiple linked data sources, equivalent entities may be identified with different identifiers. The goal of the deduplicaton phase is to identify entities that are similar semantically in order for them to be merged into one single entity in the next phase: semantic fusion. To perform this task, the rules of linkage between two entities that are similar need to be defined.
The literature suggests a number of Linked Data frameworks addressing this task (also called Linked Discovery). Silk - Link Discovery Framework [17] uses the declarative Silk - Link Specification Language (Silk-LSL) defined in XML and a proprietary path syntax to define the linkage rules. Data publishers can specify which types of RDF links should be discovered between data sources as well as which conditions data items must fulfill in order to be interlinked. These link conditions can apply different similarity metrics to multiple properties of an entity or related entities which are addressed using a path-based selector language. The resulting similarity scores can be weighted and combined using various similarity aggregation functions.
Silk accesses data sources via the SPARQL protocol and can thus be used to discover links between local and remote data sources. SILK uses a multi-dimensional blocking technique (MultiBlock [18]) to optimize the linking runtime through a rough index pre-matching. To parallelize the linking process, SILK relies on MapReduce. The framework allows user-specified link types between resources as well as owl:sameAs links. SILK incorporates element-level matchers on selected properties using string, numeric, temporal and geo-spatial similarity measures. SILK also supports multiple matchers, as it allows the comparison of different properties between resources that are combined together using match rules. SILK also implements supervised and active learning methods for identifying Link Specifications (LS) for linking. One of the shortcomings of the framework is that it uses a proprietary document-based format for configuring linkset specification and custom path syntax. The model does not provide a mechanism for defining additional constraints on the nodes referred by the paths. The SILK framework comes with a workbench which allows to define the link set specification in a visual way (see Figure 2).
Another, and more recent framework, is the Linked discovery framework for MEtric Spaces (LIMES). This framework implements time-efficient approaches for large-scale link discovery based on the characteristics of metric spaces. [19] It is easily configurable via a configuration file as well as through a graphical user interface. According to its web site: [20]
"LIMES implements novel time-efficient approaches for link discovery in metric spaces. Our approaches facilitate different approximation techniques to compute estimates of the similarity between instances. These estimates are then used to filter out a large amount of those instance pairs that do not suffice the mapping conditions. By these means, LIMES can reduce the number of comparisons needed during the mapping process by several orders of magnitude. The approaches implemented in LIMES include the original LIMES algorithm for edit distances, HR3 [21], HYpersphere aPPrOximation algorithm (HYPPO) [22], and ORCHID [23]. Additionally, LIMES supports the first planning technique for link discovery HELIOS [24], that minimizes the overall execution of a link specification, without any loss of completeness. Moreover, LIMES implements supervised and unsupervised machine-learning algorithms for finding accurate link specifications. The algorithms implemented here include the supervised, active and unsupervised versions of EAGLE [25] and WOMBAT [26].
The LIMES framework consists of eight main modules of which each can be extended to accommodate new or improved functionality. The central modules of LIMES is the controller module, which coordinates the matching process. The matching process is carried out as follows: First, the controller calls the configuration module, which reads the configuration file and extracts all the information necessary to carry out the comparison of instances, including the URL of the SPARQL-endpoints of the knowledge bases S (source) and T(target), the restrictions on the instances to map (e.g., their type), the expression of the metric to be used and the threshold to be used.
Given that the configuration file is valid w.r.t. the LIMES Specification Language (LSL), the query module is called. This module uses the configuration for the target and source knowledge bases to retrieve instances and properties from the SPARQL-endpoints of the source and target knowledge bases that adhere to the restrictions specified in the configuration file. The query module writes its output into a file by invoking the cache module. Once all instances have been stored in the cache, the controller chooses between performing Link Discovery or Machine Learning. For Link Discovery, LIMES will re-write, plan and execute the Link Specification (LS) included in the configuration file, by calling the rewriter, planner and engine modules resp. The main goal of Linked Discovery is to identify the set of links (mapping) that satisfy the conditions opposed by the input LS. For Machine Learning, LIMES calls the machine learning algorithm included in the configuration file, to identify an appropriate LS to link S and T. Then it proceeds in executing the LS. For both tasks, the mapping will be stored in the output file chosen by the user in the configuration file. The results are finally stored into an RDF or an XML file. ""
Like the SILK framework, the configuration of LIMES is defined with a proprietary XML format and syntax for expressing rules as illustrated in Figure 3
The Testbed-15 approach to model correlation is based on Open Linked Data Standards (RDF, SHACL, OWL, SPARQL). The rationale is to be able to favor reusability of the correlation rules between two linked data sets, by leveraging the linking properties that is built-in in the linked data framework by referring to URI identifier. Three ontologies were formalized to model metrics Metrics Ontology, similarity Similarity Ontology and correlation Correlation Ontology to support correlation task using a correlation engine implemented for this testbed.
8. Data Fusion
In the context of data integration, Data Fusion is defined as the “process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation [27].
8.1. Conflict Classification
Due to the decentralized nature of the web, different communities can represent the same-real world objects in different way. This results in conflicts. We can distinguish three type of conflict [27].
-
Schematic conflicts: Such as, different attribute names or differently structured data sources.
-
Identity conflicts: As the way of identifying a real-world object may be different in the data sources.
-
Data conflicts: For the same real-world object (e.g., a building), semantically equivalent attributes, from one or more sources, do not agree on its attribute value (e.g., source 1 reporting “23” as the building’s age, source 2 reporting “25”).
The first two kinds of conflict are resolved during the semantic mediation and correlation phase of data integration. The third kind of conflict, data conflicts, remain and are not resolved until data fusion and are caused by the remaining multiple representations of same real-world objects.
We distinguish two kinds of data conflict: (a) uncertainty about the attribute value, caused by missing information; and (b) contradictions, caused by different attribute values [27]:
-
Uncertainties: An uncertainty is a conflict between a non-null value and one or more null values that are all used to describe the same property of an object. In the Testbed-15 scenario, this is caused by missing information, such as null values in the sources, or an attribute completely missing in one source. The reason for considering uncertainties as a special case of conflict is that they are generally easier to cope with than contradictions. We deliberately choose to assume most null values in a data integration scenario being unknown values. But even with considering null values as being inapplicable or withheld, as are the three most common semantics of null values [28], the assessment of the different techniques and systems remains valid.
-
Contradictions: A contradiction is a conflict between two or more different non-null values that are all used to describe the same property of the same object. In the Testbed-15 data integration scenario, this is the case if two or more data sources provide two or more different values for the same attribute on the same object, sameness as given by the schema matching and duplicate detection steps before.
8.2. Data Fusion Strategies and Answers
We can distinguish three categories of conflict strategies:
-
Conflict-ignoring strategies do not make a decision as to what to do with conflicting data and sometimes are not even aware of data conflicts. An example for an ignoring strategy is Pass It On, which presents all values and that way defers conflict resolution to the user.
-
Conflict-avoiding strategies acknowledge the existence of possible conflicts in general, but do not detect and resolve single existing conflicts. Instead, they handle conflicting data by applying a unique decision equally to all data, such as preferring data from a special source with the Trust Your Friends strategy.
-
Conflict resolution strategies do regard all the data and metadata before deciding on how to resolve a conflict. They can further be subdivided into deciding and mediating strategies, depending on whether they choose a value from all the already present values (deciding) or choose a value that does not necessarily exist among the conflicting values (mediating).
Bleiholder et al (2005) [29] formalize some possible conflict handling strategies summarized in Table 1
Strategy |
Classification |
Short Description |
PASS IT ON |
ignoring |
escalates conflicts to user or application |
CONSIDER ALL POSSIBILITIES |
ignoring |
creates all possible value combinations |
TAKE THE INFORMATION |
avoiding, instance based |
prefers values over null values |
NO GOSSIPING |
avoiding, instance based |
returns only consistent tuples |
TRUST YOUR FRIENDS |
avoiding, metadata based |
takes the value of a preferred source |
CRY WITH THE WOLVES |
resolution, instance based, deciding |
takes the most often occurring value |
ROLL THE DICE |
resolution, instance based, deciding |
takes a random value |
MEET IN THE MIDDLE |
resolution, instance based, mediating |
takes an average value |
KEEP UP TO DATE |
resolution, metadata based, deciding |
takes the most recent value |
8.3. Data Fusion Answers
The fusion result to an integrated information system should have the following characteristics:
-
Complete: A complete answer contains all the objects (extensionally complete) and also all attributes (intentionally complete) that have been present in the sources. A complete answer is not necessarily concise, as it may contain objects or attributes more than just once.
-
Concise: An answer is concise if all real-world objects (extensionally concise) and all semantically equivalent attributes (intentionally concise) present are described only once.
-
Consistent: A consistent answer contains all tuples from the sources that are consistent with respect to a specified set of integrity constraints (inclusion or functional dependencies) [30]. In this sense, such an answer is not necessarily complete, as all inconsistent object representations are left out of the result. However, given that one of the integrity constraints is a key constraint on some real-world identifier, a consistent answer is extensionally concise for all included object representations.
-
Complete and Consistent: A complete and consistent answer combines the advantages of completeness and conciseness and consists of all real-world object descriptions
8.4. Conflict Resolution Functions
The following table summarizes a number conflict resolution functions found in the literature [29].
Function | Description | Type | Domain | D or M |
---|---|---|---|---|
All |
Returns all values. |
S |
A |
D |
Any |
Returns an arbitrary (non-NULL) value. |
S |
A |
D |
First |
Returns the first (non-NULL) value, respectively. Requires ordering of the values on input. |
S |
A |
D |
Last |
Returns the last (non-NULL) value, respectively. Requires ordering of the values on input. |
S |
A |
D |
Random |
Returns a random (non-NULL) value. The chosen value differs among calls on the same input. |
S |
A |
D |
Certain |
If input values contain only one distinct (non-NULL) value, returns it. Otherwise returns NULL or empty output (depending on the underlying data model). |
S |
A |
D |
Best |
Returns the value with the highest data quality value. The quality measure is application-specific. |
SP |
A |
D |
TopN |
Returns n best values (see Best). n is a parameter. |
S |
A |
D |
Threshold |
Returns values with data quality higher than a given threshold. The threshold is given as a parameter. |
SP |
ND |
D |
BestSource |
Returns a value from the most preferred source. The preference of source may be explicit (given preferred order of sources) or based on an underlying data quality model. |
SP |
A |
D |
MaxSourceMetadata |
Returns a value from the source with a maximal source metadata value. The metadata value may be, e.g., timestamp of the source, access cost or a data quality indicator. The used type of source metadata is either given as a parameter or fixed. |
SP |
A |
D |
MinSourceMetadata |
Returns a value from the source with the minimal source metadata value (see MaxSourceMetadata). |
SP |
A |
D |
Latest |
Returns the most recent (non-NULL) value. Recency may be available from another property, value/entity metadata or source metadata (the last case is a special case of MaxSourceMetadata). |
SP |
A |
D |
ChooseSource |
Returns a value originating from the source given as a parameter. |
SP |
A |
D |
Vote |
Returns the most-frequently occurring (non-NULL) value. Different strategies may be employed in case of tie, e.g., choosing the first or a random value. |
S |
A |
D |
WeightedVote |
Same as Vote but each occurrence of a value is weighted by the quality of its source. |
SP |
A |
D |
Longest |
Returns the longest (non-NULL) value. |
S |
SCT |
D |
Shortest |
Returns the shortest (non-NULL) value. |
S |
SCT |
D |
Min |
Returns the minimal (non-NULL) value according to an ordering of input values. |
S |
SND |
M |
Max |
Returns the maximal (non-NULL) value according to an ordering of input values. |
S |
SND |
M |
Filter |
Returns values within a given range. The minimum and/or maximum are given as parameters. |
SP |
SND |
D |
MostGeneral |
Returns the most general value according to a taxonomy or ontology. |
SP |
T |
D |
MostSpecific |
Returns the most specific value, according to a taxonomy or ontology (if the values are on a common path in the taxonomy). |
SP |
T |
D |
Concat |
Returns a concatenation of all values. The separator of values may be given as a parameter. Annotations such as source identifiers may be added to the result. |
S |
A |
M |
Constant |
Returns a constant value. The constant may be given as a parameter or be fixed (e.g. NULL). |
SP |
A |
M |
CommonBeginning |
Returns the common substring at the beginning of conflicting values. |
S |
S |
M |
CommonEnding |
Returns the common substring at the end of conflicting values. |
S |
S |
M |
TokenUnion |
Tokenizes the conflicting values and returns the union of the tokens. |
SP |
S |
M |
TokenIntersection |
Tokenizes the conflicting values and returns the intersection of the tokens. |
SP |
S |
M |
Avg |
Returns the average of all (non-NULL) input values. |
S |
N |
M |
Median |
Returns the median of all (non-NULL) input values. |
S |
N |
M |
Sum |
Returns the sum of all (non-NULL) input values. |
S |
N |
M |
Count |
Returns the number of distinct (non-NULL) values. |
S |
A |
M |
Variance, StdDev |
Returns the variance or standard deviation of values, respectively. |
S |
N |
M |
ChooseCorresponding |
Returns the value that belongs to an entity (resource) whose value has already been chosen for a property A, where A is given as a parameter. |
MP |
A |
D |
ChooseDepending |
Returns the value that belongs to an entity (resource) which has a value v of an property A, where v and A are given as parameters. |
MP |
A |
D |
MostComplete |
Returns the (non-NULL) value from the source having fewest NULLs for the respective property across all entities. |
SP |
A |
D |
MostDistinguishing |
Returns the most distinguishing value among all present values for the respective property. |
SP |
A |
D |
Lookup |
Returns a value by doing a lookup into the source given as a parameter, using the input values. |
SP |
A |
M |
MostActive |
Returns the most often accessed or used value. |
SP |
A |
D |
GlobalVote |
Returns the most-frequently occurring (nonNULL) value for the respective property among all entities in the data source. |
S |
A |
D |
Coalesce |
Takes the first non-null value appearing. |
S |
A |
D |
Group |
Returns a set of all conflicting values. Leaves resolution to the user. |
S |
A |
x |
Highest Quality |
Evaluates to the value of highest information quality, requiring an underlying quality model. |
SP |
A |
D |
Most Recent |
Takes the most recent value. Most recentness is evaluated with the help of another property or other data about recentness of tuples/values. |
SMP |
A |
D |
Most Active |
Returns the most often accessed or used value. Usage statistics of the knowledge base can be used in evaluating this function. |
S |
A |
D |
Choose Corresponding |
Chooses the value that belongs to the value chosen for another column. |
S |
A |
D |
Most complete |
Returns the value from the source that contains the fewest null values in the attribute in question. |
S |
A |
D |
Most distinguishing |
Returns the value that is the most distinguishing among all present values in that property. |
S |
A |
D |
Highest information value |
According to an information measure this function returns the value with the highest information value. |
S |
A |
D |
This list of functions could be used as a starting point to design an ontology for fusion which defines conflict resolution function (may be based on SHACL functions). Unfortunately, due to time constraints, this would have to be investigated in future Testbeds.
9. Shapes Constraint Language (SHACL)
As SHACL is used in the ontologies defined for this Testbed to support a knowledge fusion pipeline, this section provides an overview of the W3C standard Shape Constraint Language (SHACL).
Work in previous Testbeds [31] utilized RDF and OWL to define sets of classes and properties that could be reused by a large number of external vocabularies. RDF Schema is a vocabulary for expressing classes and their properties, as well as associations of properties with classes. OWL is an extension of RDF Schema to express restrictions. However, providing restrictions and constraints in the ontology itself, and enforcing them cannot be captured with these technologies. Providing restrictions required the user or developer to read the documentation and implement the constraints in code.
The Shape Constraint Language (SHACL) is a W3C standard vocabulary for describing and validating RDF graph structures. These graph structures are captured as "shapes", which correspond to nodes in RDF graphs. These shapes identify predicates and their associated cardinalities, and datatypes. SHACL shapes can be used to communicate data structures associated with a process or interface, to generate or validate data, or to drive user interfaces. The vocabulary allows for well-defined and complex integrity constraints defined in RDF and SPARQL/JavaScript constraints. SHACL is not a replacement for RDFS/OWL, but a complementary technology that is not only very expressive but also highly extensible.
Both OWL and SHACL rely on RDF Schema to define vocabulary terms (classes/properties) and their hierarchies (subclasses, sub-properties). The property constraints (cardinality, valid values, etc.) can be captured using SHACL. SHACL can accommodate multiple profiles by providing different shapes for the same ontology.
The SHACL vocabulary is not only defined in RDF itself, but the same macro mechanisms can be used by anyone to define new high-level language elements and publish them on the web. This means that SHACL will not only lead to the reuse of data schemas but also to domain-specific constraint languages. Further, SHACL can be used in conjunction with a variety of languages beside SPARQL, including JavaScript. Complex validation constraints can be expressed in JavaScript so that they can be evaluated client-side. In addition, SHACL can be used to generate validation report for quality control with potentially suggestions to fix validation errors. Overall, SHACL is a future-proof schema language designed for the Web of Data.
In summary, features of SHACL include:
-
RDF vocabulary used to express shapes, targets and constraints.
-
Constraints can be expressed in extension languages like SPARQL.
-
SHACL shapes can be mixed with other semantic web data that is compatible with RDF and linked data principles.
-
SHACL definitions can be serialized in multiple RDF formats.
9.1. Comparison of OWL and SHACL
There is a fundamental difference in the interpretation of OWL restrictions versus SHACL constraints. OWL is designed for inferencing, meaning that the interpretation of restrictions leads to OWL making assumptions and inferences about the data. OWL is based on an Open-World Assumption [4], where absent statements and properties can be filled later, and absence of data does not invalidate the statement. The application should infer the missing data. Violations in for example cardinality mean that the OWL processor will assume the violating values must represent the same entity with different URIs. This is because OWL makes a distinction between URI and real-world entity where multiple URIs can represent the same entity. This is also referred to as the Unique-Name Assumption. [4]
SHACL is designed based on a Closed-World Assumption [4], where lack of data invalidates the statement, meaning that the interpretation of restrictions is based on the assumption that the knowledge base is complete or can be assumed to be as complete as possible with the current information. Aside from the built-in constraints, SHACL constraints can be expressed using SPARQL or in JavaScript, making it highly flexible and extensible. In contrast to OWL, where limited data validation is done via inferencing, SHACL separates checking data validity from reasoning and inferring new facts. OWL’s built-in nature of the Open World Assumption and the Unique Name Assumption contradicts established approaches from schema languages and makes the meaning of certain statements (e.g., cardinality) different from what most modelers expect. [4]
Due to the differences in design philosophy and implementation, one of the main advantages of SHACL over OWL is that SHACL is extensible while OWL is limited to the features as defined by the OWL committee. In the end, the SHACL vocabulary is complementary to OWL, but has a higher usability. The usability increase can be experienced through the SHACL shapes in the definition of constraints and constraint targets, as well as the built-in constraint types.
9.2. SHACL Shapes
When the SHACL standard was released by the W3C in 2017, it introduced the concept of shapes. SHACL shapes are described in terms of RDF graphs, which is then referred to as a shapes graph, and follow a hierarchy of shapes based on the RDF Schema language. SHACL aims to validate RDF graphs against the shapes, where the data being validated is called a data graph.
There are two types of shape: node shape that declare constraints directly on a node and property shape that declare constraints on the values associated with a node through a path. Node shapes declare constraints directly on a node e.g., node kind (Internationalized Resource Identifier (IRI), literal or blank), IRI regex, etc. Property shapes declare constraints on the values associated with a node through a path, e.g., constraints about a certain ongoing or incoming property of a focus node; cardinality, datatype, numeric min/max, etc. (see Figure 4)
SHACL shapes may define several target declarations. Target declarations specify the set of nodes that will be validated against a shape, e.g., directly pointing to a node, or all nodes that are subjects to a certain predicate, etc. The target declarations of a shape in a shapes graph are triples with the shape as the subject and one of sh:targetNode, sh:targetClass, sh:targetObjectsOf or sh:targetSubjectsOf as a predicate.
Shapes also declare constraints which are constraints on the focus nodes and value nodes of their properties. Constraints to be applied to the target, e.g. cardinalities, ranges of values, data types, property pairs, etc. They can also declare rules that can be used to add inferences or perform mapping from one model to another (see Figure 5 ).
A more detailed description of the SHACL model is shown in Figure 6. The SHACL specification has a number of extensions that enable expanding the expressiveness of the validation. The SHACL-SPARQL extension provides mechanism to add Constraint Component plugins and Rules based on the standard SPARQL. The SHACL advanced features provides additional mechanism to add rules for inferencing and transformation. SHACL-JS provides JavaScript extensions for SHACL, which could implement functions for constraint component validators, target selections and rules.
RDF terms produced by targets are not required to exist as nodes in the data graph. Targets of a shape are ignored whenever a focus node is provided directly as input to the validation process for that shape. A focus node is an RDF term that is validated against a shape using the triples from a data graph.
An RDF graph uses namespaces to anchor to the appropriate (ontology) vocabularies, which can be stored in industry standard formats like Turtle, JSON-LD and RDF/XML. SHACL itself is defined as part of a namespace as well. There are a number of standard namespace prefixes that can be encountered as part of the shapes definition:
Prefix | Namespace |
---|---|
rdf: |
|
rdfs: |
|
sh: |
|
xsd: |
While SHACL is defined using OWL, there was no UML diagrams available for the specification. To support the implementation of the SHACL engine, a UML Object Model compliant with the SHACL ontology has been defined as part of this work. The model may not capture all the nuances of the specification. However, the model was designed to have share a common understanding. The model should not be considered normative, but only informative. The model provides extensions points for Functions, Rules, Constraint Components and TargetTypes. The built-in and most commonly used constraint components can be directly set to the shapes to simplify their encodings. An overview of the model is shown in Figure 7.
The next subsections describe the other first-class business objects of the SHACL specification in more detail.
9.3. Constraint Components
Constraint Components are at the core of the validation process. SHACL defines the concept of constraint components which are associated with shapes to declare constraints. Each node or property shape can be associated with several constraint components.
Constraint components are identified by an IRI and have two types of parameters: mandatory and optional. The association between a shape and a constraint component is made by declaring values for the parameters (called parameter bindings). The parameters are also identified by IRIs and have values. Most of the constraint components in SHACL Core have a single parameter and follow the convention that if the parameter is named sh:p, the corresponding constraint component is named sh:pConstraintComponent. The Constraint Component model is shown in Figure 8.
Constraint components are associated with validators, (see Figure 9), which define the behavior of the constraint. Writing custom Constraint Component is considered as an advanced feature of the system. However, this provides a powerful extension mechanism to represent more advanced constraints such as spatial-temporal constraints.
For example, the DASH namespace includes a collection of SHACL constraint components that extend the Core of SHACL with new constraint types including value types, string-based, property pairs, relationship constraint components.