Published

OGC Engineering Report

OGC Testbed 19 Analysis Ready Data Engineering Report
Liping Di Editor David J. Meyer Editor Eugene Yu Editor
OGC Engineering Report

Published

Document number:23-043
Document type:OGC Engineering Report
Document subtype:Implementation
Document stage:Published
Document language:English

License Agreement

Use of this document is subject to the license agreement at https://www.ogc.org/license



I.  Executive Summary

Implementations of the Analysis Ready Data (ARD) concept are consistent with the FAIR principles of finding, accessing, interoperating, and reusing physical, social, and applied science data with ease. The goal of this Testbed 19 OGC Engineering Report (ER) is to advance the provision of geospatial information by creating, developing, identifying, and implementing ARD definitions and capabilities. Specifically, this ER aims to increase the ease of use of ARD through improved backend standardization and varied application scenarios. Additionally, this work seeks to inform ARD implementers and users about standards and workflows to enhance the capabilities and operations of ARD. Ultimately, the goal of the work described in this ER is to maximize ARD capabilities and operations and contribute to the enhancement of geospatial information provision.

Four distinct scenarios – gentrification, synthetic data, coverage analysis, and coastal studies – are explored to reveal both the strengths and limitations of the current ARD framework. The gentrification scenario, which utilizes existing Committee on Earth Observation Satellites (CEOS) ARD data, highlights the need to expand ARD’s scope beyond Earth Observation (EO) data. The integration of diverse data types, such as building footprints and socio-economic statistics, is crucial for comprehensive analysis. The synthetic data scenario explores the potential of simulated EO imagery to enhance data availability and diversity for machine learning applications. However, challenges in standardization and quality assessment require further investigation. The analysis of coverages for ARD reveals the importance of clear pixel interpretation (“pixel-is-point” vs. “pixel-is-area”) and standardized units of measure for seamless integration and analysis. Additionally, enriching the metadata structure with defined extensions is crucial for efficient data discovery and understanding. The coastal study scenario, where in-situ data needs to be elevated to ARD, emphasizes the need for flexible levels of readiness. Different analytical tasks may require distinct data properties, necessitating adaptable standards that cater to temporal emphasis, spatial alignment, and non-GIS applications like machine learning.

This work identified several key areas for improvement:

In addition to the above recommendations, the interoperability and support of ARD in wider communities warrants further exploration and implementation. Additionally, areas such as uniform evaluation and compliance certification could be further investigated to ensure consistency in data readiness across various hierarchies and application domains.

II.  Keywords

The following are keywords to be used by search engines and document catalogues.

testbed, web service, analysis ready data, remote sensing, earth observation

III.  Contributors

All questions regarding this document should be directed to the editors or the contributors:

Table — List of Contributors

NameOrganizationRole
Liping DiGeorge Mason UniversityEditor
David MeyerNASA Goddard Earth Sciences Data and Information Services Center (GES DISC)Editor
Eugene YuGeorge Mason UniversityEditor
Josh LiebermanOGCTask Architect
Peter BaumannRasdamanContributor
Dimitar MishevRasdamanContributor
Chris AndrewsRendered.AIContributor
Matt RobinsonRendered.AIContributor
Daniel HedgesRendered.AIContributor
Li LinGeorge Mason UniversityContributor
Glenn LaughlinPelagis Data SolutionsContributor
Carl ReedCarl Reed and AssociatesContent Reviewer
Jim AntonisseWiSC EnterprisesContributor

1.  Introduction

In this Engineering Report (ER), Analysis Ready Data (ARD) refers to time-series stacks of overhead imagery that are prepared for a user to analyze without having to pre-process the imagery themselves [71][88][103][110]. The idea behind ARD is that providers of satellite imagery are in a better position to undertake these routine steps than the average user [71]. Analysis-ready datasets have been responsibly collected and reviewed so that analysis of the data yields clear, consistent, and error-free results to the greatest extent possible [71].

ARD is important because it saves time and resources by providing users with data that has already been preprocessed and rigorously validated and is ready for analysis. ARD also ensures that users have access to high-quality data that has been reviewed for accuracy and consistency [71]. ARD can be used in various applications such as land cover mapping, change detection, and environmental monitoring. The concept and implementation of analysis readiness can significantly address both climate and disaster resilience needs for information agility by improving access to interdisciplinary sciences such as natural, social, and applied sciences as well as engineering (civil, mechanical, etc.) public health, public administration, and other domains of analysis and application.

The CEOS Analysis Ready Data (ARD) strategy aims to simplify data handling by removing many of the fundamental data correction and processing tasks from users so that more users and more uses of the data are possible [113]. CEOS ARD involves satellite data that have been processed to a minimum set of requirements and organized into a form that allows immediate analysis with a minimum of additional user effort and interoperability both through time and with other datasets [1].

The Testbed 19 ARD ER reviews all existing standards and previous ARD work, including CEOS ARD efforts [116][7][10][12][14][16][18][20][22][29][103][110][4] and previous OGC efforts [24][42][56]. The ER will define foundational elements that allow for the mixing and matching of different standards and target its mission of implementing Findable, Accessible, Interoperable, and Reusable (FAIR) principles for scalable and repeatable use of data [31]. ARD is a key example of the capability to enable the FAIR principles of finding, accessing, interoperating, and reusing physical, social, and applied science data easily.

The Testbed 19 activities included:

The ARD ER clearly describes and reports scope, objectives, methodology, and expected outcomes of the Testbed 19 ARD work. The ER describes ARD requirements, identifies initial use case objective(s), and the components and elements needed to achieve these objectives. Further, the ER describes how the Testbed participants achieve the objectives, and, if applicable, identifies technology gaps or elements for future work. Finally, this ER summarizes how this work is scalable within other domains or to be applied more broadly within the same domain.

2.  Analysis Ready Data

This section provides the basic concepts of and requirements for analysis ready data (ARD).

2.1.  Definition

CEOS defines ARD as “satellite data that have been processed to a minimum set of requirements and organized into a form that supports immediate analysis with a minimum of additional user effort and interoperability both through time and with other datasets.” This definition needs to be expanded to cover non Earth Observation data, such as model outputs, in-situ measurements, demographic data, and economic data which may need geocoding to register them geospatially. Broadening the definition to encompass all geospatial data, ARD is then defined as geospatial data that have been processed to a minimum set of requirements and organized into a form that enables immediate analysis with a minimum of additional user effort and interoperability both through time and with other datasets. The ultimate direction is working towards the FAIR principles[31] of finding, accessing, interoperating, and reusing physical, social, and applied science data easily.

2.2.  Fundamental Requirements

The fundamental requirements for making a dataset analysis ready are as follows.

  1. Ensure Data Quality: The quality of data is critical in preparing the data for analysis. Data need to be accurate, consistent, complete, and free of errors. Thus, ensure that all the datasets being used are of high quality.

  2. Data Cleaning: The next step is to clean the data set. Data cleaning involves removing duplicates, filling in missing values, and removing any irrelevant variables from the dataset.

  3. Standardize Data Formats: Data come in different formats and types. Differences in coding or labeling of datasets may become a major problem during analysis. Thus, standardize the format of the data to make the data analysis-ready.

  4. Data Integration: Often, data for analysis come from multiple sources. In such a situation, different datasets might have varied column names, restrictions, clarifications, or even misalignments. So, it becomes essential to integrate the data sets into a single conflated and comprehensible dataset.

  5. Variable Identification: Knowing what each variable of a dataset represents is important. Proper documentation of each variable makes it easier to understand the dataset and improves the quality of the analysis.

  6. Data Segmentation: In addition to integrating data sets, segmenting or partitioning the results as per a certain criteria or logic may be required if time-series analyses or testing hypotheses are carried out.

  7. Ensure Data Security and Privacy: Protecting trustworthy data ensures the access and integrity of valuable datasets for analysis. While the specific requirements may vary depending on the data characteristics and intended use, upholding a high degree of data security and privacy remains paramount.

  8. Data Storage: The final step is to store the data in a secure, but accessible manner. Best practices recommend storing data at a secure location, where the data are accessible to authorized users, with proper backup and disaster recovery provisions in place.

2.3.  Product Families

2.3.1.  Earth Observations (Satellite remote sensing)

The major components of satellite remote sensing ARD typically include the following properties[1][110]:

  • radiometric and geometric correction;

  • mosaicing and tiling;

  • cloud and shadow masking;

  • atmospheric correction; and

  • metadata and catalog.

In Testbed 19, several ARD products have been found and originally used for different analysis. These products were combined into a comprehensive dataset for focused analysis. These include Landsat data products for gentrification scenario and essential earth observations for marine and coastal study scenario.

2.3.2.  Model Outputs

Model outputs carry different properties and characteristics for preparing and publication as ARDs. One example is synthetic data which have a clear underlying physical model but simulate sensor observation under different conditions. The serving and preparing of synthetic results as ARDs is one of the scenarios studied in Testbed 19.

2.3.3.  Other Geospatial Data

There still exist data that do not quite fall into the existing CEOS ARD families. For example, in-situ observations may need to be pre-processed through a series of algorithms to provide the observations as ARD ready for integration and interoperation with other data sources and analytical systems. The in-situ data were studied in the coastal stud scenario.

Demographic data and building information are other examples of non-EO data that need to be prepared and made analysis ready. Processing may involve geocoding and other specific pre-processing.

Besides “strict” geospatial data, there may be data that have certain geospatial properties, but the focus is rather aligning the data ready for analysis through AI/ML (artificial intelligence and machine learning). For example, training datasets may include broad ranges of labeled data for machine learning. The inclusion of such datasets in ARDs is also discussed in this ER.

3.  Scenario on Gentrification Study

NOTE:    This scenario was led and implemented by the Center for Spatial Information Science and Systems (CSISS), an interdisciplinary research center chartered by the provost and affiliated with the College of Science at George Mason University, Fairfax VA, 22030, U.S.A.

3.1.  Introduction

Due to advancements in technology and acquisition strategies, institutions, such as the USGS and European Commission, can release vast amounts of remote sensing images under free licenses. Simultaneously, storing and computing power has developed to accommodate this increase in data. Yet, the volume of data still poses a great challenge to data analysts, scientists, and non-experts alike. Data analysts report spending 80% of their time cleaning data to ensure interoperability for time series[25]. Furthermore, much of the storage and computing power required to store and process these data are inaccessible to non-experts. Many scientists believe that the solution to both problems is a concept termed “analysis ready data” (ARD). The committee of Earth Observation satellites defines ARD as satellite data that have been processed to a minimum set of requirements and organized into a form that supports immediate analysis with a minimum of additional user effort and interoperability both through time and with other datasets [43]. In other words, an ARD product undergoes common data processing before distribution. These preparations are time consuming, computationally taxing, and require expertise to perform. Currently, the closest thing to a standard for ARD is CEOS Analysis Ready Data for Land (CARD4L), which outlines the threshold and target quality of data for it to be considered ARD [57].

The first set of data to meet the CARD4L requirements was the USGS’s Landsat collection 2 surface reflectance and surface temperature products [43]. Collection 2 has a great depth of remote sensing imaging as the USGS has also reprocessed its images from collection 1 from Landsat 1-8 [72] which enables users to use ARD for a time series that spans back to the first Landsat mission. Collection 2 is organized into 3 levels [72]. Level 1 is geometrically and radiometrically corrected data [72]. Within level 1 a tier system is used to distinguish between the quality of data based on the radical root mean square error (RMSE) [72]. Tier 1 is the best contains data with a RMSE lower than 12, all data with an RMSE greater than 12 are grouped into tier 2 [72]. Near real time data are data that have yet to be categorized into either tier 1 or 2 and are available for rapid download [72]. Moving onto Level 2 data which are certified by CEOS as ARD, Level 2 products are only derived from level 1 data and use top of atmosphere corrections to provide surface reflectance and surface temperature [72]. Furthermore, all ARD products include a quality assurance band [72]. As of January 28th, 2022 in addition to Landsat, Sentinel 2’s Level-2A product has been certified for meeting the threshold of ARD by CEOS [89]. The Sentinel 2 product is broken into several levels. Level-0 is the raw compressed image from the satellite that is downlinked [104]. Level-1A involves creating a geometric model to locate pixels in the image and uncompressing the images [104]. Level-1B involves performing radiometric calibrations and geometric refining to the geometric model produced in Level-1A [104]. Then Level-1C involves Ortho-image generation, top of atmosphere (TOA) is computed, and clouds are calculated [104]. Finally, Level-2A provides surface reflectance products based on the TOA product as well as scene classification for clouds [104].

In short, the goal of ARD is to standardize and centralize data to make it more accessible in a way that removes friction for users working with remote sensing data 3. CEOS created a guideline for what constitutes ARD data in the CEOS ARD for land (CARD4L) product, providing fundamental support to ARD standards [111]. Likewise, The Open Geospatial Consortium (OGC) Testbed-16 worked to solidify what ARD is by creating a list of characteristics for data to be considered ARD, including, but not limited to, homogeneous organization, georeferencing, units, and metadata detailing changes made [111]. This scenario is implemented to show how ARD helps improve the ease of use and accessibility of data through a case study. The scenario aims to provide insights and recommendations to the OGC Standards Working Group (SWG) responsible for moving the ARD standards set forth by the International Organization for Standardization and Open Geospatial Consortium.

3.2.  Scenario Methodology

3.2.1.  Datasets

The Earth Explorer data portal provides access to the USGS’s Landsat collection 2 data. Collection 2 consists of three levels. Level 1 consists of data that have been geometrically and radiometrically corrected. Level 1 uses an internal tier system to organize data based on pixel quality and processing level [114]. Real time data is where imagery goes before being moved into either tier 1 or 2, which is determined by the image radical root mean square error (RMSE) [72]. A score of 12 or better is categorized as tier 1. Level 2 data are ARD certified by CEOS and are derived from tier 1 data only [117], [2]. Furthermore, there are level 2 science products that have a long enough time series and consistency that the products can be used to track climate change [5]. Lastly, level 3 data are data derived from level 2 science products. Using Earth Explorer, ARD surface reflectance and associated quality assurance products were downloaded of the tile horizontal 27 vertical 9 spanning from 2013 to 2019, which encompassed the city of interest, Washington DC.

While gentrification bears resemblances to community redevelopment, it usually progresses more rapidly and is frequently propelled by significant financial investments. Moreover, gentrification often brings about notable shifts in micro-level socioeconomic dynamics. To track gentrification, the Testbed participants focused on the construction of new buildings in DC. While ARD is available up until early 2023, the most recent building footprint of DC was from 2019. Like the DC building footprint, the building permit data were accessed from Open Data DC. The building permit data were organized by year, which was then filtered by the type ‘construction’ and subtype ‘new building’. Since construction takes several years, analysis was not only for the year that the permits were approved but also the years before and after. Working within these constraints, the earliest building permit used was 2012, and the latest was in 2018.

model_builder

Figure 1 — Example of model builders used to create training data

Because the Landsat imagery is ARD, there was no need to perform any top of atmosphere corrections, so the process of turning the Landsat imagery into training datasets could begin immediately. First, all data were added to an ArcGIS map. As this scenario is exploring the capabilities of ARD, using a common and easily accessible tool like ArcGIS meant that the methodology could be easily replicated, and conclusions of ARD’s limitations and abilities would be applicable to the largest number of people. Furthermore, as seen in Figure Figure 1, ArcGIS’s built-in model builder expedited the process of creating training data by taking full advantage of the standardization of the ARD. After adding all the downloaded information, the building footprint was dissolved by shape area to remove shared boundaries. The dissolved building footprint was then used as the feature mask in the extract by mask geoprocessing tool to extract only pixels of DC from the Landsat imagery.

Then, an ArcGIS geoprocessing tool was used to remove all pixels that were not marked as clear pixels. All the data were reorganized into different folders by year, which were then grouped together by intervals of three. As a result, there would be a folder 2013-2015, with subfolders 2013, 2014, 2015 and so on for folders 2014-2016, 2015-2017, 2016-2018, and 2017-2019. Next, true values were created, by separating pixels known to be new buildings which was done by extracting by mask, with the building permit for the respective year acting as the mask. As mentioned before, for each permit the participants wanted to look at the year prior and after as well. For example the 2013-2015 images used the 2014 building permit data to extract the pixels that had new buildings. Pixels that were designated as not having new buildings were stored separately in another folder. In short, this resulted in three groups of data: Images with just the new building pixels, images without the new building pixels, and images of both types of pixels. The last set would be used as the unsupervised training data. Each group and set of years would be organized into separate mosaic datasets. In other words, there would be three 2013-2015 mosaic datasets, one for each group, and so on.

It was difficult to turn the mosaic dataset into a multidimensional dataset because ArcGIS was unable to read the ARD’s metadata. So, information such as the product name and acquisition date were missing in the ARD. However, the ARD file names are formatted to be human-readable [104], and include the acquisition date as a part of the file name. Therefore, creating a new field in the mosaic dataset and splicing the file name, with the calculate field geoprocessing tool, created the necessary temporal component to create a multidimensional raster, with the build multidimensional info geoprocessing tool. Finally, the Space time Cubes were created using the Create Space Time Cube from the Multidimensional Raster spatial analysis tool. One advantage of using this tool was the fill empty bins parameter, which used an interpolated univariate spline algorithm to create a temporal trend to fill the empty bins.

Using the datasets created from the ARD, other built-in machine learning tools within ArcGIS were used and tested. With the release of ArcGIS 3.1, ‘Train Using AutoML’ and ‘Predict Using AutoML’ became available. Both tools enable the user to streamline the machine learning process by determining the best models, hyperparameters, and creating the optimal ensemble of models for the validation set[Esri_automl]. So, the ‘Train Using AutoML’ tool was used for building a machine learning model optimized for the dataset given, and the ‘Predict Using AutoML’ was used to predict where gentrified pixels would be in the future.

3.2.2.  Supervised Machine Learning

A feature class was needed to use the “Train using AutoML” tool. Therefore, the rasters had to be turned into polygons. Since the goal is for the model to predict future gentrified pixels, the dataset of rasters that contained only gentrified pixels was used. This dataset had previously been used as the input for the multidimensional rasters. Each raster was converted to an individual polygon feature using the ‘raster to polygon’ tool. The polygon feature stored each gentrified pixel as its row, so each pixel was assigned a date and satellite based on the raster it came from. Then, each polygon feature was added to a larger feature class grouped into the years 2013-2015, 2014-2016, 2015-2017, 2016-2018, and 2017-2019 like how the space time cubes were organized. All these steps were automated using the model builder feature of ArcGIS, which was made possible due to the ARD’s standardization of file names and pre-processing. Finally, the feature classes could then be used as input into the ‘Train Using AutoML’ tool, with the grid code being the dependent value the model was trying to predict, and the date and satellites being the independent variables being used to explain that change over time. The models created could then be used by ‘Predict Using AutoML’.

3.3.  Results and Discussions

In this study, the strengths and limitations were evaluated for using Landsat ARD on the workflow of creating training data and used the training data to evaluate the AutoML tools and time series clustering tool built into ArcGIS. Overall, the evaluation of the various pre-defined machine learning tools in ArcGIS shows that standard machine learning algorithms are ideal in all use cases, such as monitoring gentrification. Moving on to look at how ARD affects workflow, the Landsat ARD was easily accessible due to the USGS Earth Explorer website and ARD tile tiling. Earth Explorer maintains accessibility by having free, indexable, and easily downloadable data. Furthermore, the radiometric and geometric correction standardization allows for immediate interoperability and comparison in time series. The benefits of this are threefold: firstly, increased accessibility to remote sensing data because the user doesn’t need to be familiar with applying algorithms that perform radiometric and geometric corrections to use the images. Secondly, this reduces both the labor hours and computing power that would’ve been needed to apply said corrections. Lastly, ARD maximizes accessibility by being processed enough that it can be used immediately upon download, but not so specifically that it can only be used in a narrow domain. This is exemplified by the creation of the ESRI’s Space Time Cubes, which are similar to Data Cubes. Immediately after adding the data to ArcGIS, the images could be altered to fit the needs of the use case. Furthermore, the built-in quality assurance (QA) band that comes with each raster removed poor quality pixels from the data set.

However, there were some minor limitations of working with ARD. Firstly, the QA bands for the different satellites used different encodings to represent different pixel quality, meaning that two model builders had to be used to process the data. Another area for improvement was metadata incompatibility with ArcGIS, which supports the specific Landsat satellites but not the ARD. It is hoped that ESRI will address this in the near future. Lastly, it is recommended that users should be able to access ARD through a Data Cube-like format, especially in the described scenario, which prepares ARD to machine learning-ready training datasets. Currently, thanks to the standard human-readable file and raster names, acquisition date, band information, and satellite used could easily be added to datasets that had trouble reading the ARD metadata, but injecting ARD into Data Cube will provide a much smoother user experience. This recommendation was also mentioned in the Climate Resilience Pilot Engineer Report. Analysis Ready Data Cubes (ARDC) can be important in processing big data which will make ARD more accessible and easier to use [8]. Analysis Ready Data Cubes (ARDC) play a crucial role in efficiently processing large datasets, making ARD more accessible and user-friendly. The development of ARDC will not only enhance the practical application of ARD for Earth Observation (EO) data but will also facilitate the integration of non-EO datasets, fostering the development of applications that address real-world problems by seamlessly combining EO and non-EO data while adhering to ARD standards.

In conclusion, notwithstanding ARD’s minor limitations, ARD significantly optimized the workflow process of turning the downloaded data into organized machine learning training datasets due to its accessibility and immediate interoperability.

4.  Scenario on ISO/OGC Coverage and Datacube Standards

NOTE:    The scenario on the topic of reviewing ISO/OGC coverage and datacube standards for Analysis Ready Data was led and implemented by rasdaman.

4.1.  Introduction

This analysis investigates how analysis-ready the OGC Coverage Implementation Schema (CIS) is[26][44][58]. Coverages are the accepted paradigm for modeling fields (in the sense of physics) across standard bodies with a geospatial focus [107]. Technically speaking, coverages encompass regular and irregular grids, point clouds, and general meshes. The gridded data, specifically, resemble datacubes, which is the particular focus of this Engineering Report.

One use case to investigate (with the help of GeoDataCube(GDC) that was set up in parallel to this scenario activity) is how far the ISO/OGC coverage standards carry in supporting the analysis readiness of geospatial data, in particular: CIS 1.1[75] and the OGC Web Coverage Processing Service (WCPS) 1.1[92] Standards.

4.2.  Coverages — A Data Structure for ARD in Earth Observation

This section summarizes the key findings and recommendations regarding the use of coverages as a data structure for Analysis Ready Data (ARD) in Earth Observation (EO). For detailed information, please refer to Annex B.

4.2.1.  Standards and Structure

  • Standards Alignment: Coverages adhere to OGC Standards such as CIS 1.1 and WCPS, aligning with ISO 19123-2 and 19123-3 for consistency and interoperability.

  • General Grid Coverage: This core structure defines the spatial and data aspects of ARD, consisting of the following.

    • Domain set: Geospatial reference system

    • Range set: Data values and their types

    • Range type: Data format (e.g., numerical, categorical)

    • Metadata: Additional information about the data

4.2.2.  WCPS — A Datacube Language for ARD Processing

  • Datacube Model: WCPS provides a framework for organizing and manipulating large EO datasets.

  • Common Operations: WCPS offers a set of functions for processing and analyzing coverages.

  • User-Friendly Syntax: Similar to FLOWR, WCPS enables users to express processing tasks intuitively.

4.2.3.  Obstacles and Recommendations for Coverages as ARD

This study analyzed challenges in using coverages for ARD and proposes solutions.

1 Data Modeling

  • Pixel-in-X Misconception: Clarify that pixels are associated with specific coordinates, not cells. Invest in educational resources to address this confusion.

  • Pixel Interpretation: Standardize on “pixel-is-area” for consistency and avoid half-pixel shifts.

  • Units of Measure: Adopt QUDT for its machine-readable format and conversion capabilities.

  • Tiling Transparency: Make tiling an internal detail of ARD, transparent to users.

  • Structured Metadata: Organize and structure metadata for improved access and comprehension. Consider a registry of defined extensions for efficient information extraction.

2 Data Processing

  • Context-Aware Interpolation: Utilize appropriate interpolation methods such as kriging based on data type and context.

  • Compatible Image Pyramids: Allow only compatible interpolation methods during retrieval and processing to avoid inconsistencies.

  • Data Summarization: Clearly document appropriate aggregation methods for different data types (e.g., counts vs. averages) to prevent misinterpretations.

  • Dimension Hierarchies: Capture and document hierarchical structures (e.g., time series) for efficient analysis and exploration.

  • Validity and Reliability Masks: Implement masks to identify and filter out areas of uncertainty, improving data reliability.

  • Product Provisioning Coherence: Track data processing history and ensure consistency across versions and providers to maintain data quality.

  • Numerical Effects Awareness: Understand the inherent inaccuracies of floating-point numbers to avoid calculation errors.

4.2.4.  Practical Examples in Context

This study examined how coverages can be applied to real-world scenarios.

  • Service Quality Parameters: Define and communicate key parameters such as accuracy, resolution, and uncertainty to users for informed decision-making.

  • Coverage Fusion: Leverage advanced techniques and cloud computing to combine data from diverse sources despite format variations, quality differences, and spatial/temporal overlaps.

  • Machine Learning Integration: Utilize ML models for automated ARD processing and analysis, ensuring data quality and model suitability for specific tasks.

4.3.  Recommendations for Standard Development

Based on the insights gained, the following are recommendations for improving ARD standards.

  • Refine Existing Standards: Update OGC standards to address identified challenges and reflect current needs in EO.

  • Enhance Metadata Structures: Design and implement standardized metadata structures to accommodate diverse use cases and scenarios.

  • Universal Units of Measure: Promote QUDT adoption for consistent and interoperable data exchange.

  • User-Friendly APIs: Focus on clear data access and processing functionalities in APIs, hiding technical details.

  • Interval Arithmetic Adoption: Utilize interval arithmetic to quantify uncertainties and provide more reliable calculation results.

  • Fitness Negotiation and SLAs: Develop mechanisms for users to specify quality requirements and services for guaranteed data suitability.

  • Model Applicability Parameters: Define clear parameters for machine learning models to ensure appropriate use and reliable results.

4.4.  Conclusion

By addressing the challenges and implementing the proposed recommendations, coverages can become a powerful and versatile data structure for ARD in Earth observation which will enable efficient and accurate analysis for diverse applications, advancing scientific research and decision-making in EO.

5.  Synthetic Data Scenario

NOTE:    This scenario was led and implemented by Rendered.AI.

5.1.  Introduction

One of the fundamental reasons to define Analysis Ready Data standards is that it is common for real world datasets to have insufficient metadata or structure for common tasks and analyses. Synthetic data generation, the process of creating datasets that simulate real data according to predefined specifications, offers an opportunity to advance the concept and application of ARD by the following.

  1. Providing benchmark or referenceable examples of how an ideal dataset would be composed including content, metadata, and structure.

  2. Supporting specific use cases or examples of commonly used datasets, such as Earth observations satellite content, that can be used to test data processing tools and pipelines.

  3. Providing experimental input for both training and validating algorithms used to process real sensor data.

This Testbed 19 ARD project provided a demonstration of a synthetic data generation pipeline that produces diverse datasets in an ARD-compliant format, specifically the CEOS ARD for Land – Surface Reflectance (CARD4L-SR) Standard. The goal of this process is to better understand how synthetic data can provide value to the creators and users of ARD, and how the framework for ARD can likewise benefit the creators and users of synthetic data. In the process, the implications of synthetic data generation within this ARD framework are explored as well as what elements of an ARD specification might be beneficial for supporting synthetic data and its uses.

5.2.  Methodology

The synthetic data application produced for this project supports the simulation of Analysis Ready Data for electro optical remote sensing imagery. The data generation pipeline utilizes industry leading physics-based image simulation technology that enables for the creation of sensor model approximations of existing Earth imaging platforms currently in orbit. This application also leverages remote-sensing derived content that has been assembled to produce “digital twins” of locations on Earth at various scales. This content simulation capability was then configured into a synthetic data channel on the Rendered.ai platform that outputs all necessary dataset and pixel-level truth information to meet the threshold requirements of the CARD4L-SR specification.

5.2.1.  Sensor Simulation

The simulation capability demonstrated in this effort uses the Rochester Institute of Technology’s Digital Imaging and Remote Sensing (DIRS) Laboratory simulation technology, DIRSIG. DIRSIG enables the simulation of physically accurate electro-optical data with accurate spectral properties and radiometric responses calculated at sub-pixel resolutions which is done using detailed models of sensor and platform properties that drive a path-traced radiometry estimation performed against provided spectrally defined 3D content.

The image capture platforms chosen to be modeled for this application were:

Table 1

Platform TypeSensor ApproximationGSDSpectral BandsArray Size
Medium resolution EOMaxar WorldView-3~1.24 m9-channel VIS+NIR640 x 480
High resolution EOPlanet SkySat 16-21~0.75 m5-channel PAN+VIS+NIR1024 x 768

These platforms represent common data sources for users in the remote sensing and ARD community.

5.2.2.  Scenes

Simulation scenes were selected from a set of available radiometrically annotated scenes that provide a variety of geospatial content which can be used to provide product family information. Scenes in DIRSIG are defined using the following.

  1. 3D content, including terrain surface model and specific models of above-ground assets

  2. Material maps that define material type for all surfaces in the scene

  3. Material emissivity curves for all material types referenced

  4. Texture maps that associate varied material curves within each material type

The scenes selected for this application are as follows.

  • Suburban scene: This scene represents an 8 km2 area modeled after the Rochester suburb of Irondequoit, NY.

  • Industrial scene: This scene represents a 10 km2 area modeled after a chemical plant in the desert town of Trona, CA.

RAI_Scenes_Side-by-side

Figure 2 — RAI Scene Side By Side (Left: Overhead image of the Suburban scene; Right: Overhead image of the Industrial scene.)

These scenes were constructed by researchers and engineers at RIT’s DIRS Laboratory and represent high-fidelity 3D geometry and spectra purpose-built for simulation within DIRSIG.

5.2.3.  Atmospherics

The capability to modify atmospheric conditions and visibility is included with the application. This is achieved using atmosphere models generated using MODTRAN spectral modeling software. Separate atmospheric models were generated for urban and rural aerosol levels, as well as summer and winter conditions found at mid-latitude. Within these four categories of atmosphere, five different visibility levels were also modeled, including 5, 10, 15, 30, and 50 km visibility. These variations put the total number of atmospheric combinations at twenty, allowing for a flexible determination of atmospheric properties.

Importantly, atmospheres can also be removed from the simulation to approximate the desired output of imagery post-processed for atmosphere removal. By default, when an atmosphere is selected in the simulation configuration, this channel outputs two images per run of the simulation: one with the atmosphere included and one without. This design was chosen to support the use case of atmospheric removal process development, testing, and validation.

Clouds and cloud shadows are also modeled within this application. This approach uses a voxel-based approach where individual voxels contain information about water vapor concentration, which is used to approximate absorption and scattering of various wavelengths of light.

5.2.4.  Synthetic Data Application

The synthetic data application is developed using the “Channel” implementation based upon Rendered.ai’s open source Ana framework. The channel is then deployed to the Rendered.ai platform, allowing users to generate synthetic datasets on-demand using the web-based dataset configuration interface. This graph interface supports the explicit definition of which parameters to control and which to randomize to produce the desired diversity in the output dataset.

RAI-ARD-graph

Figure 3 — RAI ARD Graph (Node and edge-based graph configures simulation inputs on the Rendered.ai platform.)

This channel is configured to output imagery with all information required to meet the threshold ARD requirements including dataset and pixel-level metadata. The output of running a simulation using this channel is a zipped folder containing output data cubes for each run of the simulation, JSON files containing dataset-level metadata, and pixel mask images that show pixel-level metadata designated in the ARD specification.

The synthetic data application developed for this project was deployed to the Rendered.ai platform, and can be utilized by the general public by using a content code within the web platform. The content code specific to this application is “ARD.” Within the workspace included in this content code, users will find pre-configured graphs for specific image scenario simulations, as well as pre-generated datasets that include all required metadata and annotations specific to the CARD4L-SR standard.

5.3.  Discussion

5.3.1.  Applications of Synthetic Data for the ARD Community

Synthetic data has many potential applications for ARD practitioners. The synthetic data channel developed for this project enabled creators of ARD data products to produce baseline datasets to develop and test algorithms for automated generation of ARD calibrations and metadata. This includes atmospheric calibration, water and ice pixel masking, cloud and cloud shadow detection, and terrain occlusion and shadowing. As these processes often require significant amounts of ground truth data to develop, the use of synthetic data can alleviate the need for expensive data collection and labeling campaigns.

Synthetic data also has uses for consumers of ARD datasets, supporting the development and testing of custom processing techniques with known land cover, sensor, and atmospheric inputs. The configurability of the data output enables fine-grained experimentation to understand the impacts that varied conditions and collection parameters have on algorithm performance. Also, because the scene and collection parameters can be fully customized, synthetic data allows users to approximate data collected from imaging platforms that do not yet exist, or of objects on Earth that have never been captured in imagery, thereby reducing barriers to innovation.

5.3.2.  Synthetic Data and the ARD Standard

The CEOS ARD standard has been developed to enable users of real Earth Observation data with all information needed to perform common and complex analytics with those data. It is also a helpful guide for developers and users of synthetic EO data, as the standard lays out guidelines for associated descriptive data. Often, annotations and metadata constructed for synthetic data are customized to include only the information relevant to the task it was engineered for, but as synthetic data become more widely adopted, dataset structure metadata will need to become more standardized to be more widely useable. This effort serves as a step in that direction for synthetic EO imagery.

While the CEOS ARD standard was designed to apply to real EO datasets, much of the relevant dataset-level metadata of synthetic datasets generated in this exercise can be incorporated into the existing standard. For instance, the requirement to specify Auxiliary Data (section 1.14) can apply to all input content used in simulation, for example the 3D content and atmospheric databases used. Similarly, the requirement for all algorithms used in dataset generation to be listed (section 1.13) can apply to simulation algorithms used in image generation, for instance the use of DIRSIG for this simulation application. Due to the volume and granularity of this information, a distributed metadata approach, such as separate per-image metadata JSON files like those created by the Rendered.ai platform, is a preferred format for the exchange and transfer of this information.

The diversity needed for effective synthetic data requires the stochastic variation of input parameters to ensure sufficient domain coverage. To ensure data provenance, these stochastic inputs need to be traceable in image-level metadata. This is one area where synthetic data require metadata beyond what is defined in the existing CEOS ARD standard. By default, the Rendered.ai platform creates this level of metadata per run of a simulation. If synthetic data were to be considered as a novel family within the ARD standard, this would be an important element to include.

5.3.3.  Lessons Learned and Next Steps

The work done for this project showed that a synthetic EO data pipeline can be developed that can auto-generate data that meet the requirements of a CEOS Analysis Ready Data standard. This standard is already well-defined for accommodating synthetic data, though there are areas where additional requirements may be specified to ensure the most useful output synthetic data products. With synthetic data growing in importance in the realm of AI and data analytics, it is important for this form of data to be considered in any standards development effort.

With this example established, further work could be done to develop Analysis Ready synthetic data with a specific use case in mind, or to supplement an existing real Analysis Ready dataset to address issues of bias or data scarcity in the real data. Beyond this, further work could be done to develop a formal substandard within ARD to specifically support synthetic data needs, including requirements discussed in this report surrounding stochastic simulation input information to ensure fully traceable data products and outcomes.

5.4.  Conclusion

This effort serves to introduce the concept of synthetic data within the ARD community. To that end, the application developed as part of this effort will be included as a “Content Code” in the Rendered.ai platform, allowing new users to utilize the content and capabilities described in this report using a complementary thirty-day trial of the Rendered.ai platform allowing for simulation configuration and unlimited dataset generation within that period. From there, experiments can be run using the functionality and the meta described, including atmosphere removal processing, cloud detection, and detection of various land cover elements in varied scenarios.

Hopefully this effort will introduce users in the ARD community to the concept of synthetic data for EO applications. As the potential value of synthetic data is realized within this community, this will necessitate thoughtful planning around the incorporation of synthetic data techniques into demonstration and validation of Analysis Ready Data standards.

6.  Study of Coastal Environments in the Arctic

NOTE:    This scenario was led and implemented by Pelagis Data Solutions. Pelagis is an ocean-tech venture located in Nova Scotia, Canada focused on the application of open geospatial technology and standards designed to promote the sustainable use of our ocean resources.

6.1.  Introduction

Remote sensing of marine and coastal environments plays an increasingly important role towards monitoring the sustainable use of ocean resources. As the effects of climate change are especially impactful to coastal ecosystems, Earth Observation (EO) derived analysis ready datasets corrected for environmental bias and spatial and temporal resolution provide valuable insights into coastal areas that otherwise are very difficult, if not impossible, to monitor, such as mapping habitat extent and change, understanding biogeochemical processes, and monitoring human impacts and conservations.

This Testbed 19 project was designed to enhance previous work positioning the OGC suite of standards and best practices at the core of a federated marine spatial data infrastructure (MSDI). In particular, analysis ready datasets for marine and marine-terrestrial realms, as defined by the International Union for Conservation of Nature(IUCN) Global Ecosystem Typology, are reviewed and purposed towards the development of essential climate and biodiversity variables for coastal marine environments in the Canadian Arctic.

6.2.  Challenge

The concept of analysis ready data has historically been targeted towards satellite-derived datasets processed to a minimum set of requirements and organized into a form that enables immediate analysis with a minimum of additional user effort and interoperable through space and time. Although the term “analysis readiness” appears relatively generic, in practice analysis ready datasets must adhere to a minimum threshold of requirements. These requirements include defining the characteristics of the dataset, the per-pixel properties and capabilities, and the metadata describing atmospheric and geometric corrections applied to dataset observations.

A key benefit of analysis readiness is that it hides the complexities of data collection and processing of raw satellite imagery and provides application ready datasets targeting specific scenarios. In terms of interoperability of these datasets, the key characteristics when applied to geospatial applications are the spatial and temporal properties of the dataset. Information on these characteristics permits client applications (and users) to determine suitability of such datasets applied to specific domain problems over a specific temporal range and spatial extent.

Similarly, OGC provides Standards and specifications that address collections of observations provisioned through in-situ platforms and sampling programs. The OGC Abstract Specification Topic 20: Observations, measurements, and samples version 3.0 (OGC OMSv3)[60] models collections of observations associated with the properties of a feature of interest. Observations are modeled as collections over an observed property and allow for subsequent processing to derive ‘analysis readiness’. In this context, it is important to understand the overlap of term definitions to further ensure the interoperability of analysis ready datasets independent of the platform from which the datasets were derived. This separation of concerns is well addressed by the OMSv3 specification.

The following scenario revisits the role of analysis ready datasets within a regionally applied climate monitoring system. The scenario was designed to leverage analysis ready datasets combined with in-situ observations to draw direct relationships between a changing environment and dependent human activities. The core of this exercise focuses on the application of OGC Standards and specifications as adapters to provision analysis ready datasets relative to key ocean and coastal climate indicators. The usability of satellite-derived observations is dependent on key processing algorithms that transform the raw observation collections into key environmental indicators that either directly measure essential variables associated with a region of interest or indirectly contribute to further processing towards similar goals.

6.2.1.  Analysis Ready Datasets for a Digital Arctic

The Global Climate Observing System (GCOS) defines a set of Essential Climate Variables (ECVs) representing key variables that contribute to the characterization of Earth’s climate. In particular, sea ice is a key indicator of climate variability in the polar regions. Three key components representative of sea ice variability are sea ice concentration, sea ice thickness, and surface albedo. Sea ice is defined as frozen sea water which floats on the surface of the ocean, excluding ice shelves which are anchored on land but protrude out over the surface of the ocean. Long-term monitoring of sea ice is important for understanding climate change and the related impact on regional biodiversity and ecosystem services. The sea ice - surface albedo relationship is a key component of climate monitoring. A decrease in sea ice coverage directly affects surface albedo with a corresponding increase in solar heating of ocean waters.

To support the integrity of climate observations, GCOS identifies the measurable parameters to be used to characterize each ECV. For example, sea ice concentration (coverage), sea ice surface albedo, and sea ice thickness. The requirements for each measured parameter are similar to the CEOS defined Product Family Specification (PFS) schema in that GCOS defines five criteria to be used to assess the quality of measurement – spatial resolution (horizontal, vertical), temporal resolution, measurement uncertainty, stability (i.e., effects of bias over time), and timeliness (how often is the phenomenon measured and made available, e.g., daily, monthly).

For each criterion, there is a set of guidelines that must be met to support the application of any measurement to the ECV. A goal (G) is the ideal requirement to be met by an observation collection, a threshold (T) representing the minimum acceptable value and a breakthrough (B) value representing an intermediate level between the goal and threshold identifying limits of applying such measurements to specific use cases. For example, to analyze sea ice concentration for near-coast applications, the goal is set to a horizontal resolution of 1km whereas regional applications are limited to 5km resolution.

This work item complements the deliverables associated with the Digital Arctic theme of the OGC FMSDI 2023 project. In particular, the goal is to leverage satellite derived datasets to identify the changes in sea ice coverage, thickness, and surface albedo related to features of interest within the circumpolar Arctic.

There are several data sources available that provide measurements of sea ice coverage and surface albedo for the arctic region. Current work has focused on integration with data products provisioned through the Copernicus Climate Data Store (CDS) and NASA’s National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC).

Processed observations of sea ice coverage and surface albedo provided through the Copernicus Data Store are made available through the ERA5 reanalysis product gridded to a regular latitude longitude of 0.25 degrees. In addition, there is the Copernicus Arctic Regional Reanalysis (CARRA) data product gridded to regional resolution of 2.5km. For the Testbed 19 exercise, the CARRA-WEST dataset was used as the baseline for monitoring the regional climate indicators for sea ice coverage and surface albedo.

The NSIDC DAAC provides the ICESat-2 data collection derived from the Advanced Topographic Laser Altimeter System (ATLAS) instrument aboard the Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2). The NSIDC DAAC distributes Level-1, Level-2, Level-3A, and Level-3B ICESat-2/ATLAS products, which range in temporal coverage from October 2018 to present. These datasets are based on the polar orbit of the ICESat-2 satellite separating each ground track revisit into separate data granules. There are 1387 reference ground tracks in the ICESat-2 repeat orbit. The reference ground track increments each time the spacecraft completes a full orbit of the Earth and resets to 1 each time the spacecraft completes a full cycle. Metadata specific to each polar orbit is maintained as a queryable and discoverable service identifying specific data granules based on spatial and temporal extents. A spatial query provides reference to the available data granules with reference ground tracks (RGTs) that intersect the extent of a feature of interest. Once established, temporal queries are available to isolate the set of data granules representing each ground track revisit. For the purpose of the Testbed 19 exercise, the focus was on the [74] data product providing along track heights for sea ice and open water leads.

6.3.  Approach

This exercise leverages the emerging role of the OGC GeoDataCube initiative to transform the native formats provided through the CDS and NSIDC data stores into a Rasdaman vector database exposing the native data stores as “analysis ready.” For context, the area of concern was established around the protected areas of the Canadian Arctic — specifically the Ninginganiq National Wildlife Area located on the east coast of Baffin Island, Nunavut.

Temporal analysis of Surface Albedo

The inherent value of the GeoDataCube (GDC) framework is its ability to scale both in terms of volume of data and processing capabilities. This use case focuses on delegating the analysis of surface albedo to the GDC service provider over a temporal extent to determine the magnitude of change for each gridded observation. In this case, the GDC provider translates the request to a native OGC Web Coverage Processing Service (WCPS) process graph for execution by the GDC service instance to determine the change in surface albedo for the region of interest over the temporal range of May 2019 — April 2023 (see Figure 4).