Open Geospatial Consortium

Submission Date: 2021-10-06

Approval Date:   2021-10-06

Publication Date:   2021-10-07

External identifier of this OGC® document: http://www.opengis.net/doc/DP/GDC

Internal reference number of this OGC® document:    21-067

Category: OGC® Discussion Paper

Editor:   Ingo Simonis (OGC)

OGC: Towards Data Cube Interoperability

Copyright notice

Copyright © 2021 Open Geospatial Consortium

To obtain additional rights of use, visit http://www.opengeospatial.org/legal/

Warning

This document is not an OGC Standard. This document is an OGC Discussion Paper and is therefore not an official position of the OGC membership. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an OGC Standard. Further, an OGC Discussion Paper should not be referenced as required or mandatory technology in procurements.

Document type:    OGC® Discussion Paper

Document subtype:

Document stage:    Final

Document language:  English

License Agreement

Permission is hereby granted by the Open Geospatial Consortium, ("Licensor"), free of charge and subject to the terms set forth below, to any person obtaining a copy of this Intellectual Property and any associated documentation, to deal in the Intellectual Property without restriction (except as set forth below), including without limitation the rights to implement, use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Intellectual Property, and to permit persons to whom the Intellectual Property is furnished to do so, provided that all copyright notices on the intellectual property are retained intact and that each person to whom the Intellectual Property is furnished agrees to the terms of this Agreement.

If you modify the Intellectual Property, all copies of the modified Intellectual Property must include, in addition to the above copyright notice, a notice that the Intellectual Property includes modifications that have not been approved or adopted by LICENSOR.

THIS LICENSE IS A COPYRIGHT LICENSE ONLY, AND DOES NOT CONVEY ANY RIGHTS UNDER ANY PATENTS THAT MAY BE IN FORCE ANYWHERE IN THE WORLD.

THE INTELLECTUAL PROPERTY IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE DO NOT WARRANT THAT THE FUNCTIONS CONTAINED IN THE INTELLECTUAL PROPERTY WILL MEET YOUR REQUIREMENTS OR THAT THE OPERATION OF THE INTELLECTUAL PROPERTY WILL BE UNINTERRUPTED OR ERROR FREE. ANY USE OF THE INTELLECTUAL PROPERTY SHALL BE MADE ENTIRELY AT THE USER’S OWN RISK. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR ANY CONTRIBUTOR OF INTELLECTUAL PROPERTY RIGHTS TO THE INTELLECTUAL PROPERTY BE LIABLE FOR ANY CLAIM, OR ANY DIRECT, SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM ANY ALLEGED INFRINGEMENT OR ANY LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR UNDER ANY OTHER LEGAL THEORY, ARISING OUT OF OR IN CONNECTION WITH THE IMPLEMENTATION, USE, COMMERCIALIZATION OR PERFORMANCE OF THIS INTELLECTUAL PROPERTY.

This license is effective until terminated. You may terminate it at any time by destroying the Intellectual Property together with all copies in any form. The license will also terminate if you fail to comply with any term or condition of this Agreement. Except as provided in the following sentence, no such termination of this license shall require the termination of any third party end-user sublicense to the Intellectual Property which is in force as of the date of notice of such termination. In addition, should the Intellectual Property, or the operation of the Intellectual Property, infringe, or in LICENSOR’s sole opinion be likely to infringe, any patent, copyright, trademark or other right of a third party, you agree that LICENSOR, in its sole discretion, may terminate this license without any compensation or liability to you, your licensees or any other party. You agree upon termination of any kind to destroy or cause to be destroyed the Intellectual Property together with all copies in any form, whether held by you or by any third party.

Except as contained in this notice, the name of LICENSOR or of any other holder of a copyright in all or part of the Intellectual Property shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Intellectual Property without prior written authorization of LICENSOR or such copyright holder. LICENSOR is and shall at all times be the sole entity that may authorize you or any third party to use certification marks, trademarks or other special designations to indicate compliance with any LICENSOR standards or specifications. This Agreement is governed by the laws of the Commonwealth of Massachusetts. The application to this Agreement of the United Nations Convention on Contracts for the International Sale of Goods is hereby expressly excluded. In the event any provision of this Agreement shall be deemed unenforceable, void or invalid, such provision shall be modified so as to make it valid and enforceable, and as so modified the entire Agreement shall remain in full force and effect. No decision, action or inaction by LICENSOR shall be construed to be a waiver of any rights or remedies available to it.

i. Abstract

Data cubes, multidimensional arrays of data, are used frequently these days, but differences in design, interfaces, and handling of temporal characteristics are causing interoperability challenges for anyone interacting with more than one solution. To address these challenges, the Open Geospatial Consortium (OGC) and the Group on Earth Observation (GEO) invited global data cube experts to discuss state-of-the-art and way forward at the “Towards Data Cube Interoperability” workshop. The two-day workshop, conducted in late April 2021, started with a series of pre-recorded position statements by data cube providers and data cube users. These videos served as the entry points for intense discussions that not only produced a new definition of the term ‘data cube’ (by condensing and shifting emphasize on what is known as the six faces model), but also pointed out a wide variety of expectations with regards to data cube behaviour and characteristics as well as data cube usage patterns. This report summarizes the various perspectives and discusses the next steps towards efficient usage of data cubes. It starts with the new definition of the term Data Cube, as this new understanding drives several recommendations discussed later in this report. The report includes further discussion that followed the actual workshop, mainly conducted in the context of the Geo Data Cube task in OGC Testbed-17.

ii. Keywords

The following are keywords to be used by search engines and document catalogues.

Geo Data Cube, Interoperability, Software architecture

iii. Preface

Note

This document has originated in work undertaken by OGC staff together with the Group on Earth Observation (GEO) and workshop participants. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. The Open Geospatial Consortium shall not be held responsible for identifying any or all such patent rights.

Recipients of this document are requested to submit, with their comments, notification of any relevant patent claims or other intellectual property rights of which they may be aware that might be infringed by any implementation of the standard set forth in this document, and to provide supporting documentation.

iv. Submitting organizations

The following organizations submitted this Document to the Open Geospatial Consortium (OGC):

OGC

v. Submitters

All questions regarding this submission should be directed to the editor or the submitters:

Ingo Simonis (OGC)

1. What is a Data Cube?

Existing definitions coming from the (geospatial) computer science domain often focus on data structure aspects exclusively. Here, a data cube is defined as an multi-dimensional ("n-D") array of values, with emphasis on the fact that “cube” is just a metaphor to help illustrate a data structure that can in fact be 1- dimensional, 2-dimensional, 3-dimensional, or higher-dimensional. The dimensions may be coordinates or enumerations, e.g., categories.

This workshop emphasized the need to leave these computer-science based definitions behind and focus on the user perspective instead. What is a data cube from the user’s perspective? We currently observe a general shift from data centric to user centric perspectives. Users don’t care if data is stored in a relational database, in a cloud-based object store, or a file server. They are interested in the access mechanisms to the data and the processing algorithms they can apply.

The workshop was conducted within the location or geo community. Thus, the terms geo data cube and data cube are used interchangeably. It is therefore assumed that each cube has some spatial characteristics. Though no formal consensus process was applied, the following definition describes the tenor of the vast majority of workshop participants:

“A (geo) data cube is a discretized model of the earth that offers estimated values of certain variables for each partition of the Earth’s surface called a cell. A data cube instance may provide data for the whole Earth or a subset thereof. Ideally, a data cube is dense (i.e., does not include empty cells) with regular cell distance for its spatial and temporal dimensions. A data cube describes its basic structure, i.e., its spatial and temporal characteristics and its supported variables (also known as ‘properties’), as metadata. It is further defined by a set of functions. These functions describe the available discovery, access, view, analytical, and processing methods that are supported to interact with the data cube.”

As it becomes apparent, the cube is described for the user, not the data. It does not matter if the cube contains one, two, or three spatial dimensions. Time can be modeled as a set of additional dimensions, though in most cases, there is likely only a single temporal dimension that describes the time of observation. Other temporal dimensions include for example ‘validity time’ (often used in the simulation community to describe when and how long a projected value is valid).

Spatially, the ideal data cube is dense with no gaps between cells. Each cell represents an area in the real world and the set of all cells represents a continuous area without holes. Such a cube allows the retrieval of property values for any location within the bounds of the data cube. Broader definitions of a data cube support cube structures that are not spatially dense. In most cases, these are point-oriented data cubes. Here, individual data points are distributed in space and regularly ordered in the data cube. In this case, the data cubes do not contain values for locations in between data points, but may offer interpolation methods to calculate property values at any location. Examples include data cubes with a set of measuring stations that line up all stations in a single dimension. The stations provide data for discrete point locations and any property value for locations in between stations needs to be interpolated. Whereas the data cube purists insisted on the spatially dense criterion as an essential characteristic for a data cube, the majority of the workshop participants accepted the broader definition as long as the user is sufficiently informed about the applied interpolation methods.

Still, the issue could not be fully solved at the workshop, which is why the definition provided herein speaks of the “ideal data cube” being spatially dense. The following figure shows different implementations of data cubes. Each cell may contain 1 to many variables. Time can be among these variables.

dataCube
Figure 1. Different implementations of data cubes

In the figure above, cube (1) organizes cells along two spatial and one temporal dimension. Cube (2) adds altitude as a third spatial dimension. Here, time could be handled as a fourth dimension or becomes part of the variables expressed in each cell. The property versus dimension pattern is further illustrated in cube (3), which organizes time similar to other variables (properties) in a specific dimension. Technically, any dimension can be transformed into a property of a cell and vice versa. It depends on the specific set of questions that users post against the data cube. Thus, property versus dimension is not a technical challenge, but rather a decision to be made to provide the best user experience to the data cube customer, which could lead to an even stronger user-oriented definition of data cube: ‘The ideal data cube follows the mental model of its user group’.

The following cube implementations (4) to (6) illustrate further possible implementations; all following the ‘broader’ definition as discussed above. Cube (4) uses two spatial dimensions and represents different products in the third dimension. Here, cells along the product axis may have different variables. Cube (5) and cube (6) represent a set of stations.

A cube does not need to support temporal dimensions. Temporal characteristics might be expressed as property values within the attribute vector per cell. Alternatively, a cube can provide any number of temporal dimensions, i.e., treat temporal characteristics as first class citizens, which allows efficient exploration of the time dimension(s) via data cube functions.

The further a data cube implementation departs from the ideal data cube with its spatially dense characteristics, the closer the data cube aligns with a general database. There is another element that influences that blurry line between a data cube and a general database. User experience is to a good extent determined by the knowledge the user has about the cube, and the functions offered by the data cube to access and process the data. Combining these two brings in another aspect that has not been discussed in detail yet: The role of metadata. Integration and processing of data in multiple steps leads to new products after each step. With the right metadata, users can understand what constitutes each of these products in detail. Thus, a data cube represents a specific, documented product within a data integration and/or processing workflow. The data cube is constituted by a database with access and processing functions that is documented with metadata to sufficiently understand the offered data for further processing, analysis, or decision making. Depending on their position within the value-adding workflow, data cubes may offer raw data as delivered by sensors or models, analysis ready data (ARD), or decision ready information (DRI).

It is the metadata elements and functions that primarily differentiate the data cube definition taken here over other definitions that define a cube from the computer science perspective. For the user, it matters what functions are offered by a data cube instance. A user needs to understand what questions can be asked to access data that fulfills specific filter criteria, how to visualize (sub-) sets of data, or how to execute analytical functions and other processes on the data cube. If supported, the user needs to understand how to add additional processes to the data cube so that they can be executed directly on the data cube and do not require previous download of data.

All other characteristics, such as spatial and temporal details (e.g., being dense or sparse, overlapping or perfectly aligned, area- or point-based), and property details (scales of measurements, handling of incomplete data, interpolation methods, error values, etc.) are provided as cube metadata. Metadata can provide different levels of detail. In this context, it needs to be emphasized that many observations include simplifications or other decisions that influence the properties of the observations without being described as metadata or otherwise easily noticed. As examples, the individual pixels of camera sensors are often read out sequentially, or a push-broom sensor on a satellite cannot keep time still during a full swing. Regardless of the small temporal differences, a single observation time value is usually assigned in both cases.

The data cube definition provided herein does not define any characteristics of the physical storage model of the data on disk or in memory. Being fully independent of the selected data storage model, both on-demand ad-hoc created as well as long term physical read-only memory is supported. For the consumer of the data cube, it is usually irrelevant to know if the data is stored in a relational database, a cloud-based object store, or a file server. These aspects become more important in complex scenarios where for example data from several cubes needs to be fused or different security models are enforced. The situation is different for ad-hoc created data cubes. Given that these are usually produced by processes that use some other data and possibly specific parameterization, reproducibility of results might be affected due to changing content of the cube.

2. Do we need an Abstract Model for Data Cubes?

One discussion thread circled around mathematical foundations and formal abstractions of data cubes. Though there is common agreement that a mathematical foundation would eventually lead to enhanced interoperability, there was no majority among the workshop participants for such an approach. This further underpins the basic stance towards a user-centric, flexible design for data cubes being favored over fundamentally more solid but inflexible and restrictive approaches. Instead, interoperability is ensured by a unique method to describe various cube designs, leaving more levels of freedom and better acknowledging the fact that flexibility is required to adapt to the variety of user requirements and usage scenarios.

3. User’s Perspective

In his presentation, Amruth Kiran from the Indian Institute for Human Settlements described experiences and lessons learned with the Indian data cube to analyse Earth observation data together with census and sample surveys data. Using the Open Data Cube framework, they built a multi-sensor data cube with support for time series analysis. The data cube serves as a baseline for statistical analysis and machine learning model development to understand land cover changes, population development and other indicators over time at varying spatial and temporal resolutions for the whole country. The challenge, he noticed, is not so much with the setup of the data cube, but with the integration of several cubes that provide different data sets. As soon as these data cubes are based on different underlying software, APIs and additional technologies such as STAC (Spatio-Temporal Access Catalog) need to be investigated to facilitate the integration.

Gregory Giuliani, University of Geneva, presented the Swiss Data Cube Platform as a Service. Serving mostly Landsat and Sentinel data, the Swiss data cube uses the Open Data Cube framework to index the data and serves it to a series of clients such as Jupyter Notebooks, Web Services, and Web APIs. Gregory emphasized the importance of libraries being served together with the actual data cube (i.e., to support the functions described in the data cube definition section above). These libraries that in particular contain algorithms that work on the cube data are an essential part of user experience for a variety of users. For the future, Swiss Data Cube 2.0 plans to use COG (Cloud Optimized GeoTiff) for enhanced user experience with support for multiple CRS (Coordinate Reference Systems) and spatial resolutions. There is high confidence that the combination of STAC and OGC APIs (and corresponding ISO standards) will further address outstanding interoperability issues. The key challenge currently remaining is the “application-to-the-data” paradigm and “distributed application” paradigm, where new applications can be added to data cube platforms and different data cubes can be fused to produce analytical results collaboratively. Understanding the efforts and the complexity to produce data cubes, the extra value of federated cube analytics is certainly worth the investment into a common API and data cube description approach. Good experiences have been made with a metadata site that describes the various products and services offered by the data cube platform.

Jimena Juárez, National Institute of Statistics and Geography in Mexico, shared experiences with the Mexican Geospatial Data Cube. The cube is currently explored in three thematic areas: Urban growth, vegetation and deforestation, and water availability.