Publication Date: 2021-01-13
Approval Date: 2020-12-14
Submission Date: 2020-11-19
Reference number of this document: OGC 20-041
Reference URL for this document: http://www.opengis.net/doc/PER/t16-D018
Category: OGC Public Engineering Report
Editor: Joan Maso
Title: OGC Testbed-16: Analysis Ready Data Engineering Report
COPYRIGHT
Copyright © 2021 Open Geospatial Consortium. To obtain additional rights of use, visit http://www.opengeospatial.org/
WARNING
This document is not an OGC Standard. This document is an OGC Public Engineering Report created as a deliverable in an OGC Interoperability Initiative and is not an official position of the OGC membership. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an OGC Standard. Further, any OGC Public Engineering Report should not be referenced as required or mandatory technology in procurements. However, the discussions in this document could very well lead to the definition of an OGC Standard.
LICENSE AGREEMENT
Permission is hereby granted by the Open Geospatial Consortium, ("Licensor"), free of charge and subject to the terms set forth below, to any person obtaining a copy of this Intellectual Property and any associated documentation, to deal in the Intellectual Property without restriction (except as set forth below), including without limitation the rights to implement, use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Intellectual Property, and to permit persons to whom the Intellectual Property is furnished to do so, provided that all copyright notices on the intellectual property are retained intact and that each person to whom the Intellectual Property is furnished agrees to the terms of this Agreement.
If you modify the Intellectual Property, all copies of the modified Intellectual Property must include, in addition to the above copyright notice, a notice that the Intellectual Property includes modifications that have not been approved or adopted by LICENSOR.
THIS LICENSE IS A COPYRIGHT LICENSE ONLY, AND DOES NOT CONVEY ANY RIGHTS UNDER ANY PATENTS THAT MAY BE IN FORCE ANYWHERE IN THE WORLD. THE INTELLECTUAL PROPERTY IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE DO NOT WARRANT THAT THE FUNCTIONS CONTAINED IN THE INTELLECTUAL PROPERTY WILL MEET YOUR REQUIREMENTS OR THAT THE OPERATION OF THE INTELLECTUAL PROPERTY WILL BE UNINTERRUPTED OR ERROR FREE. ANY USE OF THE INTELLECTUAL PROPERTY SHALL BE MADE ENTIRELY AT THE USER’S OWN RISK. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR ANY CONTRIBUTOR OF INTELLECTUAL PROPERTY RIGHTS TO THE INTELLECTUAL PROPERTY BE LIABLE FOR ANY CLAIM, OR ANY DIRECT, SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM ANY ALLEGED INFRINGEMENT OR ANY LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR UNDER ANY OTHER LEGAL THEORY, ARISING OUT OF OR IN CONNECTION WITH THE IMPLEMENTATION, USE, COMMERCIALIZATION OR PERFORMANCE OF THIS INTELLECTUAL PROPERTY.
This license is effective until terminated. You may terminate it at any time by destroying the Intellectual Property together with all copies in any form. The license will also terminate if you fail to comply with any term or condition of this Agreement. Except as provided in the following sentence, no such termination of this license shall require the termination of any third party end-user sublicense to the Intellectual Property which is in force as of the date of notice of such termination. In addition, should the Intellectual Property, or the operation of the Intellectual Property, infringe, or in LICENSOR’s sole opinion be likely to infringe, any patent, copyright, trademark or other right of a third party, you agree that LICENSOR, in its sole discretion, may terminate this license without any compensation or liability to you, your licensees or any other party. You agree upon termination of any kind to destroy or cause to be destroyed the Intellectual Property together with all copies in any form, whether held by you or by any third party.
Except as contained in this notice, the name of LICENSOR or of any other holder of a copyright in all or part of the Intellectual Property shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Intellectual Property without prior written authorization of LICENSOR or such copyright holder. LICENSOR is and shall at all times be the sole entity that may authorize you or any third party to use certification marks, trademarks or other special designations to indicate compliance with any LICENSOR standards or specifications.
This Agreement is governed by the laws of the Commonwealth of Massachusetts. The application to this Agreement of the United Nations Convention on Contracts for the International Sale of Goods is hereby expressly excluded. In the event any provision of this Agreement shall be deemed unenforceable, void or invalid, such provision shall be modified so as to make it valid and enforceable, and as so modified the entire Agreement shall remain in full force and effect. No decision, action or inaction by LICENSOR shall be construed to be a waiver of any rights or remedies available to it.
None of the Intellectual Property or underlying information or technology may be downloaded or otherwise exported or reexported in violation of U.S. export laws and regulations. In addition, you are responsible for complying with any local laws in your jurisdiction which may impact your right to import, export or use the Intellectual Property, and you represent that you have complied with any regulations or registration procedures required by applicable law to make this license enforceable.
- 1. Subject
- 2. Executive Summary
- 3. References
- 4. Terms and definitions
- 5. Overview
- 6. ARD definition
- 7. Where to find ARD
- 8. Architectures to provide ARD
- 9. Tools for using ARD
- 10. Federated architecture for ARD
- 11. Applying ARD to Machine Learning
- 12. Recommendations
- Appendix A: Revision History
- Appendix B: Bibliography
1. Subject
The Committee on Earth Observation Satellites (CEOS) defines Analysis Ready Data (ARD) for Land (CARD4L) as "satellite data that have been processed to a minimum set of requirements and organized into a form that allows immediate analysis with a minimum of additional user effort and interoperability both through time and with other datasets".
This OGC Testbed 16 Engineering Report (ER) generalizes the ARD concept and studies its implications for the OGC Standards baseline. In particular, the ER analyses how modern federated data processing architectures applying data cubes and Docker packages can take advantage of the existence of ARD. Architectures for ARD should minimize data transmission and allow and favor code transmission and remote execution. This ER also considers a workflow in which new processes are triggered as soon as new data becomes available. This is part of the event driven discussion.
2. Executive Summary
Users spend a majority of their time preparing data relative to doing data analysis. Being able to get data that is ready for analysis saves a lot of time and effort and permits fast results and interpretation. This makes the concept of Analysis Ready Data very attractive for the decision maker. This Engineering Report has conducted a survey of the different interpretations of the ARD concept that can be classified into content readiness and technical readiness, being both necessary to enable fast analysis.
CEOS is conducting the most rigorous approach to content readiness by focusing on stablishing the requirements for making some selected satellite products ready for analysis and formalizing Product Family Specifications (PFS). The main characteristics of the CEOS ARD for Land (CARD4L) can be easily extended other dataset sources including in-situ measurements. Distilling from the CARD4L concept, a dataset is ready for analysis when it represents one or more physical variables, it is georeferenced in a common CRS, it is homogeneous and comparable in time, it is flagged with quality, wrong or missing values tags, and the process of creation is fully documented.
This Engineering Report also contemplates technical readiness aspects, such as the availability of a cloud free, continuous data in space and evenly distributed in time, as well as infrastructures to provide on-demand products or to process ARD on the cloud. These technical readiness aspects are referred to also as ARD by other communities but in this document they are referred to as technologies to improve the usability of ARD.
This Engineering Report puts ARD in the context of OGC standard services and provides several examples of use cases where usability and technical readiness of ARD is enabled by adding OGC web services. Use cases presented are: detection of significant events from several sources; integration of diverse sources to enrich a time series for remote sensing phenology variable extraction; protection of Very High Resolution (VHR) data access while allowing for data processing, forest fire detection and monitoring in remote areas; and machine learning training with ARD and ARD trained model discovery.
Some previous initiatives to provide exploitation platforms for ARD rely on OGC services and transversal technologies. Despite the use of common technologies, most of them work in isolation. This Engineering Report explores how these and other transversal technologies can be used to define a federation of exploitation platforms that integrates several sources of ARD in a distributed computing environment. The federation uses the data cube metaphor (that can be described with the Coverage Implementation Schema) to deal with the heterogeneity of the data. The federation should consider ways to parallelize and distribute processing among different services minimizing the amount of data that is transmitted. A future interoperability experiment could go deeper into testing the approach by implementing some of the proposed use cases.
A chapter in this document explores how to use ARD in the context of Machine Learning. The concept of training ready data is introduced as training sets defined as annotation on top of ARD following a particular PFS. A catalogue of models trained to performing a particular task on a ARD that follows a particular PFS is also introduced.
Finally, a future collaboration between CEOS and OGC is proposed where the OGC can contribute to increase the usability of ARD by considering ARD in data discovery, access and processing web services and to broaden the concept to other types of data. The collaboration with the OGC can also be beneficial in disseminating the ARD concept and benefits to its membership.
The Engineering Report identifies a concrete need for CEOS to include a formal indication of the physical variables that current and future CEOS PFS represent, preferably as a permanent URI in a definition service (e.g. the OGC Definitions Server). The Engineering Report also identifies the need for OGC services to support URIs to characterize the physical variables that data represents in data access services, as well as in data processing services inputs and outputs; in the same way that the SensorThings API is already doing with the ObservationProperty definition URI. We anticipate that the use of URI for physical variable will contribute to an automatic matching between data and processes that will, in turn increase the availability of derived ARD products.
2.1. Document contributor contact points
All questions regarding this document should be directed to the editor or the contributors:
Contacts
Name | Organization | Role |
---|---|---|
Joan Maso |
UAB-CREAF |
Editor |
Alaitz Zabala |
UAB-CREAF |
Contributor |
Alba Brobia |
UAB-CREAF |
Contributor |
2.2. Foreword
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. The Open Geospatial Consortium shall not be held responsible for identifying any or all such patent rights.
Recipients of this document are requested to submit, with their comments, notification of any relevant patent claims or other intellectual property rights of which they may be aware that might be infringed by any implementation of the standard set forth in this document, and to provide supporting documentation.
3. References
No normative references are required in this document. Some fundamental references are:
-
CEOS Analysis Ready Data Strategy version 1, October 2019. http://ceos.org/ard/files/CEOS_ARD_Strategy_v1.0_1-Oct-2019.pdf
-
Dwyer, J. L., Roy, D. P., Sauer, B., Jenkerson, C. B., Zhang, H. K., & Lymburner, L. (2018). Analysis ready data: enabling analysis of the Landsat archive. Remote Sensing, 10(9), 1363. https://www.mdpi.com/2072-4292/10/9/1363
-
Gonçalves P. (2019) OGC Testbed-15: Federated Clouds Analytics Engineering Report. http://docs.opengeospatial.org/per/19-026.html
-
Percivall G. (2020) Geospatial Coverages Data Cube Community Practice. https://portal.ogc.org/files/18-095r7
More non normative references can be found in the Bibliography at the end of this document.
4. Terms and definitions
The following terms and definitions apply.
- ● Analysis Ready Data
-
sensed data that have been processed to a minimum set of requirements and organized into a form that allows immediate analysis with a minimum of additional user effort for further interoperability both through time and with other datasets. (source: CEOS http://ceos.org/document_management/Meetings/Plenary/30/Documents/5.5_CEOS-CARD4L-Description_v.22.docx)
Note
|
The CEOS original definition uses the work "satellite" instead of "sensed". |
- ● CEOS Analysis Ready Data for Land
-
products processed to a minimum set of requirements and organized into a form that allows immediate analysis with a minimum of additional user effort. These products would be resampled onto a common geometric grid (for a given product) and would provide baseline data for further interoperability both through time and with other datasets. The CARD4L products are intended to be flexible and accessible products suitable for a wide range of users for a wide variety of applications, including particularly time series analysis and multi-sensor application development. They are also intended to support rapid ingestion and exploitation via high-performance computing, cloud computing and other future data architectures. They may not be suitable for all purposes and are not intended as a "replacement" for other types of satellite products (source: CEOS PFS template. Example: http://ceos.org/ard/files/PFS/NRB/v5.0/CARD4L-PFS_Normalised_Radar_Backscatter-v5.0.pdf)
- ● Interpretation Ready Data
-
geospatial data that has been submitted to some well documented common process to make it ready for direct human interpretation (probably with the help of some visualization tool) and eventual decision making. There is an expectation that result of an appropriate analysis that uses Analysis Ready Data as input will results on Interpretation Ready Data (definition by the authors based on several sources including https://www.geoaquawatch.org/wp-content/uploads/2020/05/ARD-GEO-AquaWatch-Discussion-paper.pdf)
- ● Processing levels
-
a hierarchical list of numerical levels from 0 to 4 (sometimes with a letter A, B… after the number) that indicates the processing done to remote sensing satellite images before making them accessible. Level 0 refers to products are raw data at full instrument resolution (rarely made available), Level 1 refers to reconstructed at full resolution, time-referenced, and annotated with ancillary information, including some sort of radiometric and geometric calibration coefficients and georeferencing parameters that can be applied or not to the data itself. Level 4 are model outputs or results from analyses (that can be assimilated to physical measurements on the ground, such as temperature etc.) that uses satellite data as inputs. (definition by the authors based on several sources including https://earthdata.nasa.gov/collaborate/open-data-services-and-software/data-information-policy/data-levels)
Note
|
There has been historically some confusion on processing levels because they are not used consistently among agencies or even among different products in the same agency. |
- ● Product Family Specification
-
a specification of primary geophysical measurements that can be derived from CEOS satellite instruments as ARD. A PFS provides a list of requirements for a type of product (e.g. Surface Reflectance, Surface Temperature, Normalized Radar Backscatter, Polarimetric Radar, etc.) to be considered ARD. These lists of requirements are applicable to general metadata, per pixel metadata, geometric and radiometric corrections. For each requirement two levels of verification are provided: threshold and target (definition by the authors based on several sources including http://ceos.org/ard/files/PFS/SR/v5.0/CARD4L_Product_Family_Specification_Surface_Reflectance-v5.0.pdf)
Note
|
There is an assumption of a consensus process among agencies to define these requirements that should avoid the confusion created by the processing levels concept. |
Note
|
CEOS developed the concept of "Product Families" as the second element of the CARD framework. CARD PFS is not prescriptive with regard to which data processing approach should be used. This recognizes that there are multiple processing approaches for producing ARD for a particular Product Family and that these will evolve through time. However, the data provider must document and disclose their methods as required by the PFS. |
5. Overview
Quite often - especially in data intense analysis - data preparation takes more time than the analysis itself. Data preparation involves a set of procedures that: 1.) Clean the data to remove artifacts or repetition, and 2.) Adapt the data to the requirements of the analytical tools. This preparation is a tedious and costly process that has to be completed by anyone that wants to use the data. This is particularly true for remote sensing data. This is due to the fact that satellite raw data needs to be corrected in several ways to get a product that is suitable to be used and combined with in-situ observations. The Committee on Earth Observation Satellites (CEOS) has promoted the concept of Analysis Ready Data (ARD) to simplify the use of these data. In ARD, the producer performs many of the common data preparation (pre-processing) steps, and carefully documents the provenance in the metadata. The prepared product is distributed with other offerings by several agencies thus allowing for an easier merge of products that share the same Product Family Specification (PFS). In CEOS, the PFS detail specific 'Threshold' and 'Target' requirements for the processed content.
This Testbed 16 Engineering Report (ER) is an attempt to consider ARD in relation to the current and emerging OGC Standards baseline.
Section ARD definition introduces the concept of Analysis Ready Data, provides definitions, and the possibility of extending the concept beyond remote sensing data.
Section Where to find ARD lists some current initiatives offering ARD products or ARD based services.
Section Architectures to provide ARD discusses the high-level architectures that can be used by the producers for creating and providing ARD.
Section Tools for using ARD discusses tools that help consumers to take advantage of ARD.
Section Federated architecture for ARD provides the description of a cloud based federated architecture to work with multiple sources of ARD and describes how some common use cases could perform in such architecture.
Section Applying ARD to Machine Learning discusses how ARD can be used in machine learning algorithms.
Section Recommendations includes some recommendations for future work, as well as recommendations for continuing the collaboration between CEOS and OGC are detailed.
6. ARD definition
Raw satellite instrument data is the imagery and metadata as collected by the sensor and prior to any processing. Since there are some fundamental corrections that should be applied to the imagery before it is usable, most agencies will not distribute raw imagery. The raw satellite data are simply not ready to use.
There are geometric and radiometric corrections that can be applied in sequence to raw satellite data to make it more useful. There are many good tutorials explaining why these corrections are needed and how they can be applied. One example is the NRCan Geometric Distortion in Imagery web page. These corrections are required due to errors resulting from a variety of factors. The primary geometric corrections are required due to at least four factors:
-
The orbit of the satellite platform is not parallel to a meridian, resulting in an image that is rotated in relation to the North-South axis.
-
The surface of the Earth is not flat and there are changes in the "elevation" over sea level. Most of the time the sensor has a non-vertical perspective (the position of the Earth where the sensor is observing the Earth vertically is commonly called nadir). In these situations, the perspective combined with elevation changes distort the image.
-
The Earth is not flat.
-
The optical nature of the sensor may introduce additional distortions. Fortunately, most of these distortions obey geometric laws and the effects can be reverted by using the appropriate geometric correction algorithm(s).