Publication Date: 2020-10-22

Approval Date: 2020-09-23

Submission Date: 2020-08-24

Reference number of this document: OGC 20-067

Reference URL for this document: http://www.opengis.net/doc/PER/SELFIE-ER

Category: OGC Public Engineering Report

Editor: David Blodgett

Title: Second Environmental Linked Features Experiment:


OGC Public Engineering Report

COPYRIGHT

Copyright © 2020 Open Geospatial Consortium. To obtain additional rights of use, visit http://www.opengeospatial.org/

WARNING

This document is not an OGC Standard. This document is an OGC Public Engineering Report created as a deliverable in an OGC Interoperability Initiative and is not an official position of the OGC membership. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an OGC Standard. Further, any OGC Public Engineering Report should not be referenced as required or mandatory technology in procurements. However, the discussions in this document could very well lead to the definition of an OGC Standard.

LICENSE AGREEMENT

Permission is hereby granted by the Open Geospatial Consortium, ("Licensor"), free of charge and subject to the terms set forth below, to any person obtaining a copy of this Intellectual Property and any associated documentation, to deal in the Intellectual Property without restriction (except as set forth below), including without limitation the rights to implement, use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Intellectual Property, and to permit persons to whom the Intellectual Property is furnished to do so, provided that all copyright notices on the intellectual property are retained intact and that each person to whom the Intellectual Property is furnished agrees to the terms of this Agreement.

If you modify the Intellectual Property, all copies of the modified Intellectual Property must include, in addition to the above copyright notice, a notice that the Intellectual Property includes modifications that have not been approved or adopted by LICENSOR.

THIS LICENSE IS A COPYRIGHT LICENSE ONLY, AND DOES NOT CONVEY ANY RIGHTS UNDER ANY PATENTS THAT MAY BE IN FORCE ANYWHERE IN THE WORLD. THE INTELLECTUAL PROPERTY IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE DO NOT WARRANT THAT THE FUNCTIONS CONTAINED IN THE INTELLECTUAL PROPERTY WILL MEET YOUR REQUIREMENTS OR THAT THE OPERATION OF THE INTELLECTUAL PROPERTY WILL BE UNINTERRUPTED OR ERROR FREE. ANY USE OF THE INTELLECTUAL PROPERTY SHALL BE MADE ENTIRELY AT THE USER’S OWN RISK. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR ANY CONTRIBUTOR OF INTELLECTUAL PROPERTY RIGHTS TO THE INTELLECTUAL PROPERTY BE LIABLE FOR ANY CLAIM, OR ANY DIRECT, SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM ANY ALLEGED INFRINGEMENT OR ANY LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR UNDER ANY OTHER LEGAL THEORY, ARISING OUT OF OR IN CONNECTION WITH THE IMPLEMENTATION, USE, COMMERCIALIZATION OR PERFORMANCE OF THIS INTELLECTUAL PROPERTY.

This license is effective until terminated. You may terminate it at any time by destroying the Intellectual Property together with all copies in any form. The license will also terminate if you fail to comply with any term or condition of this Agreement. Except as provided in the following sentence, no such termination of this license shall require the termination of any third party end-user sublicense to the Intellectual Property which is in force as of the date of notice of such termination. In addition, should the Intellectual Property, or the operation of the Intellectual Property, infringe, or in LICENSOR’s sole opinion be likely to infringe, any patent, copyright, trademark or other right of a third party, you agree that LICENSOR, in its sole discretion, may terminate this license without any compensation or liability to you, your licensees or any other party. You agree upon termination of any kind to destroy or cause to be destroyed the Intellectual Property together with all copies in any form, whether held by you or by any third party.

Except as contained in this notice, the name of LICENSOR or of any other holder of a copyright in all or part of the Intellectual Property shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Intellectual Property without prior written authorization of LICENSOR or such copyright holder. LICENSOR is and shall at all times be the sole entity that may authorize you or any third party to use certification marks, trademarks or other special designations to indicate compliance with any LICENSOR standards or specifications.

This Agreement is governed by the laws of the Commonwealth of Massachusetts. The application to this Agreement of the United Nations Convention on Contracts for the International Sale of Goods is hereby expressly excluded. In the event any provision of this Agreement shall be deemed unenforceable, void or invalid, such provision shall be modified so as to make it valid and enforceable, and as so modified the entire Agreement shall remain in full force and effect. No decision, action or inaction by LICENSOR shall be construed to be a waiver of any rights or remedies available to it.

None of the Intellectual Property or underlying information or technology may be downloaded or otherwise exported or reexported in violation of U.S. export laws and regulations. In addition, you are responsible for complying with any local laws in your jurisdiction which may impact your right to import, export or use the Intellectual Property, and you represent that you have complied with any regulations or registration procedures required by applicable law to make this license enforceable.

Table of Contents

1. Subject

This report documents the Second Environmental Linked Features Interoperability Experiment (SELFIE). SELFIE evaluated a proposed Web resource model and HTTP behavior for linked data about and among environmental features. The outcomes are building blocks to establish a system of real-world feature identifiers and landing pages that document them. OGC API - Features was found to be a useful component for systems implementing both landing content and representations of linked-features. More work is needed to establish best practices related to negotiation between varied representations of a feature, observations related to a feature, and for expressing and mediating between varied content from a given resource. These technical / meta-model details were found to be difficult to evaluate given the small number of example implementations and limited number of domain-feature models available for use with linked data.

2. Executive Summary

At the outset of the SELFIE project, the team stated:

SELFIE aims to answer the question, what is the Web architecture that will allow us to use linked data for environmental features and observations in a way that is easily adoptable and compatible with World Wide Web Consortium (W3C) best practices and leverages OGC standards? The experiment aims for focused simplicity, representing resources built from potentially complex data for easy use on the Web. While the IE was focused on testing a specific resource model and followed W3C best practices and OGC standards, a wide range of participant-provided domain use cases will be used for testing. Ultimately, this work is intended to satisfy the needs of many use cases and many kinds of features, from disaster response and resilience to environmental health and the built environment.

The business case for the SELFIE can be illustrated considering two use cases:

  1. indexing and discovering models and research from public sector, private sector, or academic projects about a particular place or environmental feature.

  2. building a federated multi-organization monitoring network in which all member-systems reference common monitored features and are discoverable through a community index.

These use cases imply needs along several dimensions:

  1. a shared reference network of environmental features,

  2. the ability to use the reference network to index and provide access to information resources from many organizations,

  3. support for multiple disciplines' information models, conceptual models, research topics, and monitoring practices.

While the IE did not come to conclusion on all these fronts, it did show that the core Web architecture to support identification of real-world features and retrieving information about them exists and should be pursued in earnest. The architecture has three basic components; referred to here as URI-14, URL-14, and URL-200 resources.

  1. A URI-14 resource is one that has an identifier and is itself a real-world entity.

  2. A URL-14 resource is one that is the target of a redirect from a URI-14 and provides information about a URI-14 resource.

  3. a URL-200 resource is any other resource that would be linked to by a URL-14’s content.

These three resource types can be hosted in a wide variety of organizational architectures and/or governance schemes. No one right or wrong solution was found on this front, and the technical solutions explored proved flexible and capable of adapting to many architectural patterns.

These resources were applied in the context of five functionalities:

  1. Publication of identified non-information resources

  2. Describing a network of linked features

  3. Providing landing content about non-information resources

  4. Providing structured-data to support search indexing

  5. Providing links to representations and related data

These functionalities were seen to be satisfied by four technical use cases that are loosely aligned with the functionalities:

  1. Real-world feature identification

  2. Landing content and links to other features

  3. Structured data for search indexing

  4. Links to representations and other data

The details of the URL-14 resource’s content were the main subject of debate in the IE. Some important outcomes include:

  1. A URL-14’s HTTP URL should almost never be the subject or object of a linked data triple. It is a convenience resource about the URI-14. The URI-14 should be referenced rather than the URL-14.

  2. The content of a URL-14 should, at the top level, be a set of statements about a single URI-14. While nested, or complex information about the URI-14 could be included, the document should be centered on one real-world feature.

  3. Spatial topology, monitoring relationships, and domain-specific associations between real world features should be expressed as relationships between URI-14 identifiers.

  4. Associations between URI-14 resources and representations of the feature should be expressed with a https://schema.org/subjectOf relationship. Additional nuances of URI-14 to URL-200 resources should be the subject of future work.

  5. URL-200 resources with a semantic representation (JSON-LD) can be the object of a https://schema.org/subjectOf relation. URL-200 resources that do not have a semantic representation should be represented as a "blank node" with a https://schema.org/url association to the URL-200 resource.

OGC API - Features was found to be compatible with all of the above and can be used as a core enabling Web API as networks of linked environmental features are established.

At the outset of SELFIE, the team hoped to experiment with use cases related to variation of available content and multiple data providers for URL-200 resources about a single URI-14 resource. Gaining an appreciation for the nuances of the functionalities and technical use cases required in the context of the broadly varied organizational architectures considered was a large task. Further, some basic characteristics of URL-14 resources and the landing content to include in URL-14 landing content needed to be established before further investigation could continue. Given that, future work should investigate issues such as variation of content for a single URL-200 resource, multiple URL-200 representations of the same feature with variation of content across the providers, and content negotiation of URL-14 resources to either directly access URL-200 resources or access differing profiles of URL-14 landing content.

The IE also aimed to make progress on technical solutions with observational data models and domain feature models. This work was largely deferred for the same reasons as discussed above and because publication of domain feature models and domain features themselves is a pre-requisite to meaningfully testing how to work with them in the context of observations data models. The technical baseline provided by the first and second ELFIE now sets the stage for this work to move forward.

2.1. Document contributor contact points

All questions regarding this document should be directed to the editor or the contributors:

Contacts

Name Organization Role

David Blodgett

U.S. Geological Survey

Editor

Alistair Ritchie

Manaaki Whenua

Contributor

Bruce Simons

Federation University Australia

Contributor

Eric Boisvert

Natural Resources Canada

Contributor

Abdelfettah Feliachi

BRGM - INSIDE environmental information systems research center

Contributor

Sylvain Grellet

BRGM - INSIDE environmental information systems research center

Contributor

2.2. Foreword

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. The Open Geospatial Consortium shall not be held responsible for identifying any or all such patent rights.

Recipients of this document are requested to submit, with their comments, notification of any relevant patent claims or other intellectual property rights of which they may be aware that might be infringed by any implementation of the standard set forth in this document, and to provide supporting documentation.

3. References

4. Terms and definitions

● Resource

an item of interest in the distributed network of environmental data.

● Non-Information Resource

A real-world or conceptual object of interest that is identified by a Uniform Resource Identifier bound to the HTTP protocol (HTTP URI).

Note
In the context of this document, Non-Information Resources are strictly identified by HTTP URIs. In other contexts, a Non-Information Resource may be identified using other protocols such as Uniform Resource Names.
● Information Resource

A digital resource that can be sent as a message over the internet using a protocol such as HTTP. Located using a Uniform Resource Locator (URL).

● Information Index Resource

An information resource that provides an index of annotated (metadata) links to information and non-information resources that describe or are related to the non-information resource of interest.

● Indirect Identifier

As defined in the W3C Architecture of the World Wide Web, Volume 1 Section 2.2.3. In the context of SELFIE, a URI that would usually be used to identify a digital resource is sometimes used as an indirect identifier of a real-world feature.

● In-band resource

An in-band resource is one that can provide information according to a given technical architecture. In the context of linked data, an in-band resource can provide Hypertext Markup Language (HTML) and Resource Description Framework (RDF) content serialized as JavaScript Object Notation for Linked Data (JSON-LD) representations. Other representations may also be considered in-band if the specified architecture expects them (GeoJSON for example). In-band resources can extend the linked-data graph.

● Out-of-band resource

An out-of-band resource is one that does not adhere to the technical architecture from which it is found. A resource that can provide observation result graphs, XML, CSV, PNG, PDF and JSON representations but no linked-data representation would be considered out-of-band if linked from a linked-data document. Out-of-band resources cannot extend the linked data graph.

● HyperText Transfer Protocol Uniform Resource Identifier (HTTP URI)

An identifier with the potential to be used with the HTTP protocol to dereference (look up) the identified resource.

● HyperText Transfer Protocol Uniform Resource Locator (HTTP URL)

A type of URI that can be used to locate an information resource.

● Data Resource

An information resource providing a representation of a non-information resource.

● Registry

Per ISO 19135, Geographic information, Procedures for item registration: An information system that manages a set of files containing identifiers assigned to items with descriptions of the associated items.

● Resource Model

A taxonomy and functional description of the system of non-information, index and data resources.

● Node

A source of information about real world features. May be specific to a geospatial or scientific application domain. Acts as a node in a system of linked data providers.

● Hub

An aggregator or indexer of information about real world features. Provides integrated information as landing content derived from a community of nodes.

● Provider

An originating source of data.

● Resolver

A registry system that provides 303 redirection from URI-14s to URL-14.

Note
In general, outside the scope of this document and depending on the intent of a given system, a resolver can be more than a 303 registry system. Broadly It could be any intelligent function that adapts the response of a server to the context of request
● Landing resource

The ‘default’ information resource provided, through a range-14 303 redirect, when a non-information resource’s URI is dereferenced. An abstract thing, the actual information resource returned is based on content negotiation. HTML → landing page; JSON-LD → landing data. Assumes constant content in the concrete landing page and data.

● Landing page

Presentation-oriented HTML representation of the landing resource. Resource description data are included as values of HTML tags and as structured data: JSON-LD in an HTML script tag.

● Landing data

Machine-oriented representation of the landing resource. Is the structured data object in the HTML page presented on its own. SELFIE expects the landing data media type to be JSON-LD but others are allowed, encouraged even, (RDF/XML; TTL; GML; GeoJSON etc).

● Structured data

As per Google: https://developers.google.com/search/docs/guides/intro-structured-data

● URI-14

The HTTP URI identifying a non-information resource. When dereferenced the host will respond with 303 redirect to a URL for an information resource. Content can be negotiated.

● URL-14

A URL, provided by the 303 redirect from a non-information resource’s HTTP URI, that locates an information resource. Ideally these are kept hidden (not provided as values for in data) as they shouldn’t be confused with the non-information resources HTTP URI.

● URL-200

A common or garden URL. So called because the most likely HTTP response code is a ‘200 OK’ with content. It could be the URL for a service request or a file on a file server. Content can be negotiated.

4.1. Abbreviated terms

  • API - Application Programming Interface

  • CSV - Comma Separated Values

  • CURI - Compact Uniform Resource Identifier

  • ELFIE - Environmental Linked Features Interoperability Experiment

  • GeoJSON - Geographic JavaScript Object Notation

  • GML - Geography Markup Language

  • GWML2 - Groundwater Markup Language 2

  • HTML - HyperText Markup Language

  • HTTP URI - HyperText Transfer Protocol

  • HY_Features - Surface Hydrologic Features Conceptual Model

  • IE - Interoperability Experiment

  • JSON - JavaScript Object Notation

  • JSON-LD - JavaScript Object Notation for Linked Data

  • OWL - Web Ontology Language

  • RDF - Resource Description Format

  • SELFIE - Second Environmental Linked Features Interoperability Experiment

  • TTL - Terse RDF Triple Language

  • URI - Uniform Resource Identifier

  • URL - Uniform Resource Locator

  • XML - eXtensible Markup Language

5. Overview

Objectives provides a high-level overview of how the first ELFIE’s outcomes provide context for the objectives of the SELFIE.

Domain Use Cases describes how domain use cases and more general technical use cases were used in SELFIE.

Resource / Content Model is the core discussion of the SELFIE experiment. It includes four subsections:

Detailed descriptions of the technical outcomes of the experiment are provided in:

The report wraps up with Summary and outcomes and Issues and recommendations that wrap up and illustrate issues for future work respectively.

Finally, Domain Use Cases provides summaries of selected domain use cases contributed by SELFIE participants.

5.1. Objectives

The first Environmental Linked Features Interoperability Experiment (ELFIE) sought to answer the question, "what linked data content should be included in a landing page describing an environmental feature and its relationship to other features and data?". Limiting scope in this way allowed the team to avoid the complex issues related to network behavior and the semantics of requesting default or alternate representations of a feature, representations of it, or other features and data in some way related to it. These issues were discussed in the first ELFIE — but often only briefly or in the context of defining what was expressly out of scope for the project [1]. The Second Environmental Linked Features Interoperability Experiment (SELFIE) took these issues on in earnest.

Objectives of SELFIE, from the project charter were:

  1. Evaluate a proposed resource model for multi-provider environmental feature and observation registries.

  2. Evaluate proposed HTTP behavior for non-information resources and their representations.

  3. Design and evaluate linked data feature information index resources with media-type, language, and profile content negotiation as an extension of the building blocks provided by OGC API – Features (formerly called WFS3). Within the context of these objectives, the functional and operational goals of the first ELFIE were upheld.

  4. linked-data content for describing and linking features and associated data and

  5. maintaining the rigor of OGC and W3C standards and best practices while providing easily-adopted approaches.

These can be summarized with the question, "what is the expected network behavior and resource model when resolving a Web identifier for a non-information resource?". While fairly simple on its face, this question proved to be challenging on a number of levels.

Objectives added during the IE:

  1. At a high level, we found that the architectural resource model that seemed to fit our understanding of the problem — a three-tiered resource model of Non-information resources, Meta-resources, and Data-resources — broke down when implemented in web-resources. The distinction between metadata and data is ultimately defined by the use of information and not the information itself.

  2. Semantic web technology and the rigor required for systems that support reasoning over a graph presented great opportunity and potential while introducing a level of complexity and technical specificity that was challenging to navigate as a group. The diverse backgrounds and levels of understanding of technologies made communication break-downs all too frequent.

  3. There are very few example systems that have implemented solutions to the problem pursued in SELFIE — Web-friendly landing pages for spatial features and related data. Where systems do approach the problem, they have used wide ranging technical and architectural approaches that proved difficult to compare and harmonize. The general lack of common language, standard web-resource models, or common implementation patterns meant the team often felt they were forging their own path through a thicket.

Due to these challenges, many issues discussed in SELFIE were tabled for later once more example implementations have had a chance to experiment and understand what works and why we might choose some approaches over others.

5.2. Domain Use Cases

The SELFIE relied on participants’ domain-specific use cases to provide context and drive decision making in the context of the IE. The use cases included hydrogeology, soils, hydrology, and land-survey information. Common to these use cases was the need to work with identifiers for environmental features for which multiple representations are available. Each of the use cases was implemented to one extent or another. Full details of the use cases are included in Domain Use Cases, domain use cases. Taken together and harmonized, these domain-specific use cases provided sufficient scope to determine a useful set of general use cases that are summarized in the following sections.

5.3. General SELFIE Use Cases

General, as opposed to domain-specific, SELFIE use cases are described in the section that follows. To provide greater insight into their purpose, the organizational architecture and functionalities they entail are first described in some detail. The use cases aim to maintain technical rigor while being practical and approachable. This can be seen as a balance or tension, but rather, ease of implementation was used as a filter on technically rigorous solutions — leaving complexity out where the team was not ready to recommend an easy-to-implement approach. While technical, and in some cases very specific, these use cases do not imply complete technical approaches.

5.3.1. Organizational architectures

The SELFIE included applications with a variety of organizational architectures. This diversity resulted from the social, political, and technical setting the applications were situated in. Aspects that were potentially diverse included:

  • Single to multiple non-information identifier (URI-14) registry and redirect systems.

  • Single to multiple interlinked providers of landing content (URL-14).

  • Single to multiple providers of feature representations and other data (URL-200). This diversity required some careful consideration and handling and the solutions explored in SELFIE proved to hold up well across the range of organizational architectures encountered.

5.3.2. SELFIE Functionalities

The linked data architecture that resulted from SELFIE is based on the five functionalities that are described below. These are described as functional use cases that loosely align with the general use cases. These functions were common across practically all the examples considered by the experiment and are presented here as a general set of functions for linked environmental features and related data.

Publication of identified non-information resources is a prerequisite for establishing links between features and related data. Persistence and long-term uniqueness of URIs used to identify non-information resources is helpful but cannot be guaranteed. A robust system of linked data must be able to deal with changes to identifiers through re-indexing or similarity relationships. Similarly, use of common identifiers across organizations is helpful but cannot be guaranteed. Systems of linked data must be able to handle when organizations use different identifiers to refer to the same real-world feature.

A network of linked features is formed when considering topological and domain-specific linked data associations between identified features. From an indexing perspective, this network can be "crawled" and indexed by both domain-specific and general web search crawlers. While a rich graph of linked features can be resolved or may exist within a linked data system, the functionality required here is exposure of direct links from one feature to adjacent neighbors such that the linked-feature network can be traversed by a human user or Web crawler.

Landing content is common metadata about a feature and data associated with it. In addition to this common-core metadata, landing content might also include:

  • A multi-organization index of information about the feature

  • Links to multiple or alternative representations of the feature

  • Pre-fetched information (e.g. labels and media-types) about resources.

Structured-data to support search indexing is the representation of landing content that is presented to a web-crawler. The lexicon of this most-default representation must be common to the Web (e.g. schema.org) and the breadth of content focused such that only specific pertinent details for general search, discovery, and general preview (such as a knowledge panel) are included.

Providing links to representations and related data is the ultimate purpose of the system of linked data explored in SELFIE. Such resources are generally not natively defined in linked-data formats and cannot be incorporated into the linked data graph directly. As described in detail later, such out-of-band resources must be referred to with associations like schema.org/url rather than as in-band linked data resources.

5.3.3. General Use Case Descriptions

The content model is best described with the use cases described in the following paragraphs:

  1. real-world feature identification

  2. landing pages and other default content

  3. structured data for search indexing

  4. links to representations and other data

The feature identification use case involves association of an HTTP URI with a recognized real-world feature. In the most sophisticated implementations, this would be a "URI-14" URI which only ever returns a HTTP-303 see-other directing a client to a "URL-14" which would return landing content. However, a less sophisticated implementation may conflate the URI-14 and URL-14 resources such that the feature identification use case is satisfied with a URL that returns landing content. This was found to be valid and a practical approach. While practical, it must be noted that this should be an exception to the norm and that conflating identifiers for both a non-information resource and an information resource (URI-14 and URL-14), introduces ambiguity with wanting to refer to the actual real-world entity or the digital resource.

The landing content and network of linked features use case focuses on the default content and encoding that a search-engine crawler expects. It involves the HTML media-type content returned by default when resolving a feature identification resource whether via 303-redirect or not. Structured data in landing content must be designed in the lexicon of the web, focused on schema.org and other common ontologies and encoded in JSON-LD. HTML content provides useful natural language descriptors and uses appropriate link relations wherever possible. The URL that is used to retrieve landing content could have a number of sophisticated alternative behaviors accessed via HTTP content negotiation and/or appended API patterns, but the default response when the accept header indicates HTML, would typically be designed to satisfy the needs of the landing content use case.

Structured-data for search indexing, what we might call in-band resources, could involve various lexicons and graph-views of linked data that adheres to the RDF data model and are part of a consistent structured data web architecture. Logically, such content should be returned from the URL used to retrieve landing content if the HTTP-accept header indicates a linked data type such as JSON-LD, html, and other hypermedia and media types included in a defined architecture. With regard to linked data types, use of API patterns such as those introduced in the Linked Data API or HTTP content negotiation by profile may be relevant here, a system may return all known associations to the identified feature, or a custom view of an extended linked data graph that meets the needs of an implementation. Given that multiple profiles of linked data may be available, the linked data rendered in the <script> header of the HTML representation of a landing resource may provide different content than other linked-data representations. This follows from the fact that many linked data use cases don’t necessarily focus on search-engine indexing.

Data that represent or are otherwise related to features are what we refer to here as out-of-band. Such data are not part of the system of linked data and related content. These might be a complex GML representation of a feature, an image or map, a report, or a JSON representation of a timeseries. The distinction is that a given representation of a resource is either compatible with a technical architecture (i.e. can be parsed and handled by software that works with it) or is not (i.e. is opaque to software that works with the architecture). In-band content can directly extend the linked-data graph and out-of-band cannot.

Figure 1 provides a summary of the SELFIE functionalities in the context of the range of potential implementation sophistication.

SELFIE fig1
Figure 1. The four functions of the SELFIE general use cases. The most simple implementations, while limited, may use a single resource and content negotiation for all four functions. A complete SELFIE implementation would use separate resources for each function with linked-data hypermedia to facilitate discovery and access.

6. Resource / Content Model

Resources are the stuff of the internet. Is a resource 1) its identifier, 2) content retrieved by dereferencing its identifier, or 3) some abstract notion identified by a URI and described by dereferenced content? The problems pursued by SELFIE made the importance of these distinctions clear and, to some extent, found some answers. More often than finding answers, the SELFIE found that the technical baseline pertaining to the problem is rich and relatively un-explored for environmental data use cases. That is to say that given the technical baseline available to the community, implementation and building an understanding of modern technologies should proceed in order to better understand pain points and find where additional complexity really is required vs. where existing technologies can satisfy the real need.

6.1. W3C Resources Summary

The SELFIE model for web resources is based on the notion of information resources and non-information resources as defined by the W3C document ‘Dereferencing HTTP URIs’. This summarizes the so-called ‘Range-14’ decision that, while unofficial, is useful especially when read in conjunction with ‘Cool URIs for the Semantic Web’.

Information resources are the currency of the web - the pages and data served in a digital form to be consumed by web browsers and applications. Non-information resources are the things these information resources may describe (for example people, mountains, or Aristotelian philosophical constructs). For SELFIE, the critical distinction between them is that an information resource has a location (expressed as a URL) while a non-information resource has identity, expressed as an HTTP URI.

HTTP URIs have two roles:

  1. a globally unique identifier that can be presented as an identifier string and matched with other identifier strings to establish sameness; and

  2. the location of a resolver that can redirect enquiries to the location of an appropriate information resource.

Several representative information resources describing given non-information resources may exist. For example, information resources available according to various media-types, ontologies and/or content models (e.g. data quality rules, controlled vocabularies or units of measure) may be available. SELFIE sought to refine the classification of these representative information resources to help data-providers clarify what type of resource they publish providing guidance for implementation of resolvers, catalogs, and data services.

The Environmental Linked Features in SELFIE are non-information resources. These can be thought of in the following groups (in the following, all 'features' are 'non-information resources'):

  1. Domain features - identifiable environmental things in the world.

  2. Sampling features - human-created representative samples of domain features. These features exist to provide metadata about how the feature was described and how robust/representative that description is. An information resource describing a domain feature would logically use, link to, or summarize data related to sampling features.

  3. Semantic resources: the ontologies and vocabularies used to structure and populate the descriptions of domain and sampling features. These are formalized using OWL and RDF.

Each of these groups occupy different meta-levels but can be interlinked. For example, allowing an agent traversing a knowledge graph to move from the description of a domain feature to how that description was obtained (sampling features). The links between the domain and sampling features are described with ontologies such as SOSA. Given that SELFIE has adopted JSON-LD as its RDF encoding syntax, links to semantic resources are provided via JSON-LD contexts that map JSON keys and types to RDF ontology property and class HTTP URIs.

For SELFIE, nodes in a knowledge graph are non-information resources, and the nodes of most interest are those that identify domain features. Sampling and semantic resources provide metadata that describe those features and organize available knowledge about them. Therefore, the SELFIE knowledge graph describes relationships between non-information resources in the 'real world'. The links between information resources that describe them is an important but separate concern. In SELFIE, how to transition from non-information resources in the knowledge graph to those in the web of information resources was an important consideration where further work is needed.

6.2. ELFIE Resources

At the outset of SELFIE, the team was thinking in terms of a "three-tiered resource model" where "resource" was a thing identified by a URI. The resource model involved "non-information resources", "meta-information resources", and "data-information resources". Conceptually, the "non-information" "meta" and "data" scheme is useful, but the word "resource" is wrong when applied to the terminology of particular technologies (such as HTTP Resources). As such, the team found a need to change its language to more accurately reflect what was found to be useful ways to describe what is actually a set of use-cases that can be described in terms of non-information, metadata, and data.

Everything in this scheme is identified by a HTTP URI. In general, we have three categories that can be described as follows:

  1. non-digital things that are not information,

  2. digital things that provide meta-information about non-information things, and

  3. digital things that are information representing or characterizing other things.

Tier 1. is clear — there should be URIs that only ever return a 300 series redirect and are identifiers for real-world features. Tier 2. is hard to define precisely. It can only be defined strictly by the application retrieving it rather than by specific characteristics of its content. Tier 3. is clear in most cases but has potential overlap with tier 2 in that some applications may consider metadata about a feature to actually be data representing the feature. Since self-describing data always contains metadata, we would expect most if not all of tier 2 to be contained in tier 3.

If we think about it this way, then tier 2 is a convenience layer to achieve a certain functional goal. In SELFIE, tier 2 is a convenience layer for search-engine crawlers and humans looking for an idea of what a real-world feature is, what it’s related to, and if there’s interesting data available representing it.

This should make it clear that resource is the wrong word to describe the distinction being drawn here. It is ok for tier 1, but it breaks down for tier 2 and 3. A single URL may have one or more representations designed to be metadata and one or more representations intended to be data — each intended for a different use pertaining to the same real-world feature. This is not saying that tier 2 and tier 3 are always to be represented variously based on the same URL — they very well may be represented as different resources. This technical diversity is what SELFIE sought to enable.

Consider a typical use case considered by the SELFIE:

As a Web user, I want to find all the information available for an environmental feature, so I can find what I’m looking for and retrieve it.

As the project dissected this use case the HTTP-Range 14 (303 redirects), OGC API - Features (html landing pages for features), and schema.org JSON-LD in a <script> tag of a landing page were all embraced as useful technical solutions that serve it. None of the above requires alternative resource representations (media types). With an HTML landing page, a human user or crawler designed around natural language and schema.org are satisfied.

However, the IE recognized the potential to bring structure to data (including semantically enabled data) underlying the landing page content. Going further, the SELFIE was premised on the idea that html landing pages are layered on top of a potentially wide range of data systems that need such a discovery layer. Given this, while not addressed specifically in SELFIE, alternative media types and content negotiation are expected in a system that the SELFIE model is applied to. However, the complexity and lack of broad implementation made making progress on this front difficult.

Participants in the IE agreed that they would avoid specifying how to use content negotiation between the "meta" and "data" tiers. Standards for content negotiation by profile are emerging but we have not been able to evaluate them rigorously. Instead, SELFIE was limited to describing how to advertise that multiple content-types are available for a given URL in structured JSON-LD data. The scope and summary architecture are described in Figure 2.

SELFIE Architecture
Figure 2. Summary of the SELFIE resource / content model showing that there are Non-information resources which 303 redirect to a resource intended to provide "landing content". The distinction between landing-content and data-content is use-case specific and methods for negotiating between the two is left for future work.

6.3. "In band" and "out of band" resources

The idea of "in-band" and "out-of-band" has been brought up as a useful distinction between resource representations that can provide information that is useful to a given application (in-band) and resource representations that are opaque to an application (out-of-band). In reality, there are many bands that correspond to various applications. Here, we define the SELFIE-band which is intended to foster interoperability toward the goals of the IE.

There are three defining characteristics of the SELFIE "band":

  1. The resources: ELFIE is a graph of non-information resources.

  2. The access protocol: The HTTP protocol (with no extensions [perhaps controversial?]) with responses managed according to the range-14 decision.

  3. The encoding: HTML + JSON-LD and JSON-LD in which ELFIE non-information resources are identified, and linked to, using the JSON-LD @id key.

A SELFIE resource is recognizable because:

  1. it has an @id;

  2. it has a format property that includes application/ld+json; This limited set of criteria covers the important architectural concerns. It implies an 'architectural profile' that encompasses @id, schema:url, dct:format, and rdfs:label and therefore basic resource description and linking.

To illustrate the distinction, consider the following JSON-LD example which has one schema:sameAs and one schema:subjectOf property for an identified feature:

{
  "@id": "https://feature.id",
  "http://schema.org/sameAs":
  {
      "@id": "https://someresource",
      "http://purl.org/dc/terms/format": "application/ld+json;",
      "http://www.w3.org/2000/01/rdf-schema#label": "A resource that can extend the linked data graph."
  },
  "http://schema.org/subjectOf":
  {
    "http://schema.org/url": "https://blobby",
    "http://purl.org/dc/terms/format": "application/xml;",
    "http://www.w3.org/2000/01/rdf-schema#label": "blobby thing with the feature as its subject"
  }
}

Alternatively, when we resolve `https://feature.id` we might get a more limited document that does not include pre-fetched content about `https://someresource`:

{
  "@id": "https://feature.id",
  "http://schema.org/owl#sameAs":
  {
    "@id": "https://someresource"
  },
  "http://schema.org/subjectOf": {
    "http://schema.org/url": "https://blobby",
    "http://purl.org/dc/terms/format": "application/xml;",
    "http://www.w3.org/2000/01/rdf-schema#label": "blobby thing with the feature as its subject?"
  }
}

Which would mean we would need to resolve and interrogate `https://someresource` to retrieve information needed to decide whether it is of interest, which is possible with the "in-band" `https://someresource`, and might give us the JSON-LD below, but impossible with the "out-of-band" `https://blobby` which might only return xml or linked data using an unknown ontology.

{
  "@id": "https://someresource",
  "http://www.w3.org/2000/01/rdf-schema#label": "A resource that can extend the linked data graph.",
  "http://purl.org/dc/terms/format": "application/ld+json;",
  "http://www.w3.org/2000/01/rdf-schema#seeAlso": "https://someOtherThing"
}

Note that we have avoided discussing @type and conformsTo. Use of these properties, while valuable, introduces complexities that were determined to go beyond the scope SELFIE was able to accomplish.

6.4. Resource Resolution Alternatives

The Range-14 decision, to identify real-world features with URIs that HTTP-303 redirect to resources providing information about the real-world feature, was accepted by SELFIE. Figure 3 illustrates the complete solution.

SELFIE fig3
Figure 3. Complete range-14 resolution behavior.

However, to simplify implementation, some landing resource providers skip the 303 redirect entirely, using a URL for a landing resource as an indirect identifier of a real world feature. Figure 4 Illustrates this less complicated, but limited approach.

SELFIE fig4
Figure 4. Indirect identification of a feature where a URL is used as an indirect identifier for a real-world feature.

There are two related problems with the indirect identification approach: one technical and one social. Both issues stem from the need to maintain stable identifiers for real world features and very real needs to change URLs to retrieve digital resources.

The technical issue is related to how URLs are used to drive server behavior. Changes to server software implementation often necessitate changes to URL paths or parameters. The requirement to maintain URL stability is in conflict with this and causes needless complexity for server-implementers.

Socially, real-world feature identification is a process undertaken by a group of people that is likely not the same as those who implement the server software used to retrieve information about those features. Identification of features may work best with a different URI structure than retrieval of digital information about those features; forcing the two groups of people to reconcile these patterns is an unneeded, complicated, and likely fraught interaction that can be eliminated by separating real world feature identification from information index resource identification.

Adding content negotiation to the discussion of resource resolution, a 303 redirect works fine as long as the client passes the same accept header to the redirect target URL. However, there is a common content negotiation override practice involving URL parameters such as ?f=mime-type or ?format=mime-type that may be desirable to have passed along as part of a 303 redirect. Some SELFIE participants support such mime-type overrides, but additional experimentation will be required to determine if there is a solution that should be recommended for this in general. Note that this says nothing about content-negotiation "by profile", an emerging technique that was decided to be beyond the scope SELFIE would be able to address.

Extending the resource resolution use case to include retrieving representations of a feature introduces additional functions that were the subject of some SELFIE experiments. Two such resolution schemes were tested. One required a client to inspect information index hypermedia and make an additional request for an available representation. The other used media-type content negotiation to return a representation available via that media-type directly from a URL-14 indirect identifier without the client needing to review information index hypermedia. These two schemes are illustrated in Figure 5. These alternatives are equally valid and further work is needed to determine if one is preferable to the other.