OGC Engineering Report

Testbed-18: Machine Learning Training Data ER
Sam Lavender Editor Kate Williams Editor Caitlin Adams Editor Ivana Ivánová Editor
Additional Formats: PDF
OGC Engineering Report


Document number:22-017
Document type:OGC Engineering Report
Document subtype:
Document stage:Published
Document language:English

License Agreement

Use of this document is subject to the license agreement at

I.  Abstract

This OGC Testbed 18 Engineering Report (ER) documents work to develop a foundation for future standardization of Training Datasets (TDS) for Earth Observation (EO) applications. The work performed in the Testbed 18 activity is based on previous OGC Machine Learning (ML) activities. TDS are essential to ML models, supporting accurate predictions in performing the desired task. However, a historical absence of standards has resulted in inconsistent and heterogeneous TDSs with limited discoverability and interoperability. Therefore, there is a need for best practices and guidelines for generating, structuring, describing, and curating TDSs that would include developing example software/packages to support these activities. Community and parallel OGC activities are working on these topics. This ER reviews those activities in parallel with making recommendations.

II.  Executive Summary

This OGC Testbed-18 ER begins by providing an introduction to Artificial Intelligence/ML in the context of EO. The introduction is followed by a review of existing approaches to creating and storing TDSs. Then, TDS formats are reviewed in terms of metadata, creating a catalog, expressing quality, and adherence to Findability, Accessibility, Interoperability, and Reuse (FAIR) principles. Finally, the summary reviews the next steps, best practice ideas, and the geoethics of generating and distributing training data.

III.  Keywords

The following are keywords to be used by search engines and document catalogues.

Artificial Intelligence, Earth Observation, Machine Learning, Training Dataset

IV.  Preface

IV.A.  Foreword

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. The Open Geospatial Consortium shall not be held responsible for identifying any or all such patent rights.

Recipients of this document are requested to submit, with their comments, notification of any relevant patent claims or other intellectual property rights of which they may be aware that might be infringed by any implementation of the standard set forth in this document, and to provide supporting documentation.

V.  Security considerations

No security considerations have been made for this document.

VI.  Submitters

All questions regarding this submission should be directed to the editor or the submitters:

Sam LavenderPixalytics LtdEditor
Kate WilliamsFrontierSIEditor
Caitlin AdamsFrontierSIEditor
Ivana IvánováCurtin UniversityEditor
Jim AntonisseNGAContributor
Sara SaeediOGCContributor
Sina TaghavikishOGCContributor

Testbed-18: Machine Learning Training Data ER

1.  Scope

The Open Geospatial Consortium (OGC) Testbed-18 initiative aimed to explore six tasks, including advanced Interoperability for: Building Energy; Secure; Asynchronous Catalogs; Identifiers for Reproducible Science; Moving Features and Sensor Integration; 3D+ Data Standards and Streaming; and Machine Learning (ML) Training Data (TD).

The goal of this Testbed-18 task is to develop the foundation for future standardization of Training Datasets (TDS) for Earth Observation (EO) applications. The task has included evaluating the status quo of TD formats, metadata models, and general questions of sharing and re-use. It has taken into account several initiatives, such as the European Space Agency’s Artificial Intelligence-Ready EO Training Datasets (AIREO), the Radiant MLHub, and the SpatioTemporal Asset Catalog (STAC) family of specifications.

For the purposes of this Engineering Report (ER), the authors define EO data as data that has been collected through remote sensing, including passive and active sensors carried on drones, airplanes, helicopters, or satellites.

In terms of ML applications, the most appropriate scope is supervised learning, as this type of ML directly leverages labeled training datasets. However, unsupervised learning will also be considered. These types of learning are also appropriate for the context of this work within the field of EO, as the application of ML in EO is often focused on the goal of identifying meaningful features from input EO data using a set of known mappings between inputs and desired outputs (the training dataset).

In laying out a path for future standardization of training datasets for EO applications, the ER has also taken into account and collaborated with the Training Data Markup Language (DML) for AI Standards Working Group (SWG). The SWG is chartered to develop the Unified Modelling Language (UML) model and encodings for geospatial ML training data. While these Testbed-18 activities have progressed, the SWG have released draft versions of their Conceptual Model Standard (part 1) and JSON Encoding (part 2).

2.  Normative references

The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.

AIREO: Best Practice Guidelines

AIREO: Specification AIREO specification

W3C: Data on the Web Best Practices, W3C Best Practice, 2017

Gebru, T. , J. Morgenstern , B. Vecchione, J.W. Vaughan, H. Wallach, H. Daume III, and K. Crawford. Datasheets for datasets. Communications of the ACM. 2021, 64(12):86–92.

Lavender, S. Detection of Waste Plastics in the Environment: Application of Copernicus Earth Observation Data. Remote Sens. 2022, 14, 4772.

McKee, L., C. Reed, and S. Ramage. 2011. “OGC Standards and Cloud Computing.” OGC White Paper. Accessed 29 March.

OGC: API-Records

Oxford Reference: Artificial intelligence

Yue, P., Shangguan, B., Hu, L., Jiang, L., Zhang, C., Cao, Z., Pan, Y., 2022. Towards a training data model for artificial intelligence in earth observation. International Journal of Geographical Information Science, 36(11), pp. 2113-2137,

3.  Terms, definitions and abbreviated terms

This document uses the terms defined in OGC Policy Directive 49, which is based on the ISO/IEC Directives, Part 2, Rules for the structure and drafting of International Standards. In particular, the word “shall” (not “must”) is the verb form used to indicate a requirement to be strictly followed to conform to this document and OGC documents do not use the equivalent phrases in the ISO/IEC Directives, Part 2.

This document also uses terms defined in the OGC Standard for Modular specifications (OGC 08-131r3), also known as the ‘ModSpec’. The definitions of terms such as standard, specification, requirement, and conformance test are provided in the ModSpec.

For the purposes of this document, the following additional terms and definitions apply.

3.1. Application Programming Interface

An Application Programming Interface (API) is a standard set of documented and supported functions and procedures that expose the capabilities or data of an operating system, application, or service to other applications (adapted from ISO/IEC TR 13066-2:2016).

3.2. OGC APIs

The family of OGC standards developed to make it easy for anyone to provide geospatial data to the web.

3.3.  Abbreviated terms


Application Deployment and Execution Service


Artificial Intelligence


Application Package


Analysis Ready Data


Amazon Web Services


Committee on Earth Observation Satellites


Data Markup Language


Exploitation Platform Management Service


Earth Observation


Engineering Report


European Space Agency


Findability, Accessibility, Interoperability, and Reuse


Machine Learning


Open Geospatial Consortium


SpatioTemporal Asset Catalog


Standards Working Group


Training Data


Training Dataset


Training Data Markup Language for Artificial Intelligence


Unified Modelling Language

4.  Engineering Report Overview

Artificial Intelligence (AI) and Machine Learning (ML) algorithms have great potential to advance processing and analysis of Earth Observation (EO) data. Among the top priorities for efficient machine learning algorithms is the availability of high-quality Training Datasets (TDSs). Training data (TD) is the initial dataset used to train ML algorithms. Models create and refine their rules using this data. Training data are also known as a training dataset (TDS), learning set, or training set.

TDSs are crucial for ML and AI applications, but they can also become a significant bottleneck in EO’s more widespread and the systematic application of AI/ML due to:

The Engineering Report (ER) starts by providing an introduction to AI/ML in the context of EO (Clause 5) followed by a review of existing approaches to creating and storing TDSs (Clause 6). The relevant standards applicable in terms of metadata (Clause 7), creating a catalog (Clause 8), and expressing quality (Clause 9) are reviewed in subsequent sections. Unlocking the power of geospatial resources requires that those resources are stored following the Findability, Accessibility, Interoperability, and Reuse (FAIR) principles. The FAIR principles are reviewed concerning TDS (Clause 10). Finally, the summary (Clause 11) reviews the next steps, best practice ideas, and geoethics of generating and distributing TD.

5.  Introduction to AI/ML within the context of Earth Observation

This section outlines the overall scope for this ER, beginning with a summary of Artificial Intelligence (AI) and Machine Learning (ML). Next is a discussion of how AI/ML are used in the field of Earth Observation (EO) followed by a series of case studies that demonstrate the application of ML techniques to EO data in a variety of contexts. Finally, this section concludes with a discussion of foreseen issues and opportunities in relation to the creation of an OGC standard for Training Datasets (TDSs).

5.1.  Defining AI and ML

As a field, AI covers “the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages” Oxford Reference. ML is then a subset of the AI field, specifically focusing on the creation of algorithms that learn from data without explicit programming. The output of these algorithms is then a trained ML model, which can process new inputs.

Within the domain of ML, there are three categories of application, each distinguished by their learning method.

  • Supervised learning: The algorithm is provided with a labeled TDS which pairs input data with a training label. The algorithm creates an output from the input data and compares this with the training label then iteratively updates itself to maximize its accuracy in comparison to the desired output.

  • Unsupervised learning: The algorithm is provided with input data and attempts to identify commonalities and differences in the data that allow it to be grouped. Once groups are established, the algorithm is able to identify which group new input data should belong to.

  • Reinforcement learning: The algorithm is exposed to an environment to which it must respond and is then rewarded if it responded appropriately.

This Engineering Report (ER) primarily focuses on supervised learning applications. Supervised learning algorithms include linear regression, decision trees, support vector machines, and neural networks. In particular convolutional neural networks are commonly used in EO applications due to their ability to identify features in images. Convolutional neural networks can be used for semantic segmentation (pixel-level classification) and object detection (bounding box identification).

5.2.  Typical formats for TDSs in EO Applications

A TDS is made up of two components: input data and training labels. For EO applications, the input data are remote sensing observations: Multispectral and hyperspectral satellite imagery, red, green, and blue (RGB) aerial photography, drone-mounted LiDAR point clouds, and so on. The training labels capture the classification and location of known features in the input data, such as the location and bounds of a building, or the type of crop at a given location. Training labels can be provided in vector or raster format, depending on the ML application. Supervised learning applications require both the input data and the training labels whereas unsupervised learning applications only require the input data.

5.3.  Example use cases

This ER section describes some ML for EO use cases to provide context. The use cases are based on Testbed participants’ experience particularly where ML is used to create spatial data (vector and raster) from EO data for a broad range of use cases.

The use cases are from efforts from state and national governments to create spatial data from ML on jurisdictional areas as a method of automating the creation of these products. In these use cases, reusability is a key consideration. These agencies (for example) may wish to later extend the existing training set to keep the dataset current, as well as to explore new questions and produce new information products.

As part of each use case, the authors provide a summary of implications for a TDS standard. For a specific review of the suitability of the proposed TraningDML-AI standard for use cases, see Table A.1.

5.3.1.  Use case: Mapping vegetation from aerial imagery in Australia

Project Participants: The Department of Environment, Land, Water and Planning (Victoria, Australia), FrontierSI (Australia), Orbica (New Zealand).

Project Goal: Mapping tree cover consistently over time can help governments understand land use, urban heat, and fire risks. The Victorian Department of Environment, Land, Water and Planning sought to develop an automated machine learning approach that could map tree cover from collected high resolution aerial ortho-photography.

Challenge: Due to the labor-intensive process that was traditionally used to create and update the data, Vicmap Vegetation statewide data had not been updated for 20 years. The State of Victoria wanted to trial the use of ML for statewide data maintenance, using vegetation as a test case. Project participants wanted to create a repeatable method to update data as new imagery was acquired.

Input data: The input data was 10-20cm red, green, and blue aerial ortho-photography. All imagery was resampled to 20cm before being labeled and provided to the ML algorithm.

Training labels: Training labels were created by hand digitizing vector polygons delineating areas of a range of tree coverage and density. Initial labels were then used to train a semantic segmentation algorithm to produce further Training Data (TD), which was then refined into polygons by human examination.

Training data selection: TD were stratified using a vector dataset of ecological bioregions to capture training data covering the range of tree species and ecosystems in Victoria.

Method: The project used semantic segmentation (U-Net) with transfer learning as the ML model architecture. The project successfully created a statewide dataset and delivered the resulting vector data, training data, and scripts to re-run the ML process in the future.

Key metadata

  • Geographical extent and coverage

  • Extent and data summarizing ecological bioregions

  • Resolution, spectral wavelength range, date, seasonality, and quality of input data

  • Method of generation (human only or ML with human revision)

  • ID for association with ML model metadata to understand which data were used to train the model

Implications for a TDS standard

Input data may be modified from the original source for the purpose of creating a TDS. In this case study, all input data were resampled to a uniform resolution and labels were delineated from the resampled data. As such, the labels are appropriate for the resampled imagery and should be used with caution on imagery at a higher resolution. For input data supplied alongside training labels, a TDS standard should capture any modifications that have been made relating to the creation of the labels.

ML projects may use multiple methods to create TDSs. In this case study, humans provided an initial set of labels and used these to train an ML process to produce additional labels. For quality control, a TDS standard should capture how a given training data label was produced.

5.3.2.  Use case: Capturing footprints of building roofs (roofprints) from aerial imagery in Australia

Project Participants: The Department of Environment, Land, Water and Planning (Victoria, Australia), DSM Geodata.

Project Goal: Detailed and accurate building outlines can assist decision making for planning, infrastructure, and risk modelling. The project goal was to derive high-accuracy building roofprint models from existing aerial ortho-photography.

Challenge: The derived roofprints needed to be highly accurate to meet the needs of various sectors. At the time, no commercially available products met these needs. The project approach involved training an ML model and then manually revising the predicted rooflines to match the underlying imagery.

Input data: The input data was 10cm red, green, and blue aerial ortho-photography.

Training labels: Training labels were created by hand digitizing vector polygons delineating rooflines.

Training data selection: Validation data were collected from either the residential or non-residential zones of a core urban area.

Method: Computer vision (exact model architecture unknown) with outputs reviewed and cleaned by humans to achieve high accuracy.

Key metadata

  • Geographical extent and coverage

  • Definition of residential and non-residential zones

  • Designation of whether the data belongs to the training or validation set

  • Resolution, spectral wavelength range, date, seasonality, and quality of input data

  • Method of generation (human only or ML with human revision)

  • ID for association with ML model metadata to understand which data were used to train the model

Implications for a TDS standard

The TDS in this use case contained a specific validation set with labels from both residential and non-residential zones of a specific area. Validation sets may be specifically designed to represent the expected variability and presence of features in the domain. A TDS standard should provide an optional way for a creator to distinguish between elements of the TDS that belong to the training, validation, and test sets. A future user could then review validation samples to understand the domain the training data were designed for or use the same validation set with a new ML process and fairly compare the performance of the new method with an existing one.

5.3.3.  Use case: Capturing flood extent from aerial imagery in Australia

Project Participants: The Department of Customer Service, Spatial Services Division (New South Wales, Australia), Charles Sturt University (Australia), Deloitte (Australia), Intellify (Australia)

Project Goal: Emergency response efforts require timely access to flood boundaries to aid planning, rescue, recovery, and rebuilding. The project goal was to automatically delineate flood extent from post-flood aerial ortho-photography.

Challenge: Imagery alone is challenging for humans to interpret, particularly those in emergency response that have not been trained to interpret four-band imagery. ML for automated boundary detection provides an opportunity to deliver an easily interpretable data product soon after imagery capture, aiding emergency response.

Input Data: The input data was 15cm red, green, blue, and near infra-red aerial ortho-photography. For the final ML process, three-channel imagery was used containing the near infra-red, red, and green values.

Training labels: Areas identified as flood or non-flood for captured imagery.

Method: The project used an unsupervised Gaussian mixture model which identified clusters in the data. The identified clusters were then compared to labeled imagery to assign either flood or non-flood labels. When run on imagery the Gaussian mixture model returned each pixel’s probability of being drawn from each of the identified clusters. Once clusters were labeled as flood or non-flood, pixels that had a high probability of having been drawn from a flood cluster could be labeled as flood.

Key metadata

  • Geographical extent and coverage

  • Definition of flooded areas

  • Resolution, spectral wavelength range, date, seasonality, and quality of input data

Implications for a TDS standard

While the ML process used in the case study was an unsupervised learning method, the project team still used training labels to identify clusters and then classify new input data as flood/not flood. A TDS standard should allow flexibility in how training labels are specified relative to the input imagery because the same TDS can be used as the input to many different ML approaches. Unnecessary rigidity in a TDS standard may prevent it from being applicable to newly developed TDS formats and ML applications.

5.3.4.  Use case: Classifying crops by type from satellite imagery in Zambia

Project Participants: FrontierSI (Australia), Tetra Tech (United States of America), Digital Earth Africa (South Africa).

Project Goal: Food security is a key issue in Africa. Knowledge of crop extents and types can help governments ensure access to food and plan for future. The project goal was to use ML analysis of satellite imagery and other EO products to estimate the extent of major crop types and thus availability of produce to assist with food security management in Zambia.

Challenge: The use of on-ground surveys to understand distribution of crop extent for food security is a time consuming and expensive process. Zambia needed to develop a repeatable and scaled country-wide process to provide estimates of crop type to inform food availability in a timely manner.

Input data: The input data was analysis-ready Sentinel-2 (multispectral) and Sentinel-1 (radar) satellite imagery between 10-60m resolution as well as ancillary data sets such as rainfall, digital elevation models, and analytic products derived from Landsat (multispectral) satellite imagery.

Training labels: Training labels were created using on-ground field collection with GPS-enabled mobile device to associate human identified crop type with point location. If the collector could not enter the field, the point was captured on the road and later moved into the area of the relevant crop. The vector points were labeled with the crop type with each point associated with a specific field.

Training data selection: Unsupervised learning was applied to satellite data over known cropping areas to identify clusters of spectral variability. These were then sampled to suggest locations for collecting training data randomly stratified by the area covered by each class from the unsupervised learning process.

Method: The project used supervised random forest as the ML model architecture. The project successfully created a country wide dataset and delivered the resulting raster data, training data, and scripts to re-run the process in future.

Key metadata

  • Geographical extent and coverage

  • Sampling strategy

  • Date and time of field collected label

  • Crop status at time of label (e.g., sown, ready for harvest, harvested, fallow)

  • List of classes and number of observations of each

  • Location of point relative to target field (e.g., center, roadside)

  • GPS accuracy of the point location

  • Whether the point has been updated by a human reviewer (e.g., moved from road to field while completing quality assurance)

Implications for a TDS standard

In this use case, the training data labels were collected through field sampling with a GPS-enabled device, meaning that they are not specifically tied to an input data source. If the date of capture for the training data labels is supplied then the data should still be considered a valid TDS. This is because such information would be sufficient for a user to select appropriately matched input data. A TDS standard should be aware that TDSs do not necessarily require associated input data, even though most TDSs will have this.

5.3.5.  Use case: Detection of plastics and waste across the world in terrestrial and marine environments.

Project Participants: Pixalytics (UK), CLS (France & Indonesia), RisikoTek Pte Ltd(Singapore), rasdaman GmbH (Germany).

Project Goal: Focuses on the use of a ML TDS for the detection of waste plastics. It was developed in conjunction with two projects.

  • Marlisat: European Space Agency (ESA) funded study with the overarching objective of developing a unique combination of three innovative components to constitute a plastic anthropogenic marine debris monitoring system. The components were EO for detecting the source and impact of plastics, a low-cost satellite tracker deployed at sea, and a modeling tool to understand the at-sea plastic debris transport.

  • Space Detective: Singapore-funded project with the goal of detecting waste plastics, including tires on land, so they can be recycled.

Challenge: The detection of plastics, whether it be plastic, tires, or mixed waste in waste sites across the globe in multiple land cover environments.

Input data: The primary input data was Sentinel-2 and Sentinel-1 satellite data which was supplemented by a digital elevation model and vector layers for roads and coastlines for background mapping with high-resolution commercial data to support focused activities.

Training labels: Training pixels were manually identified using a combination of the high spatial resolution satellite imagery within Google Earth and the Sentinel-2 RGB color composites for the different land cover types. Where the locations of the plastics could not be reliably identified, these land cover classes were not digitized, and only the background land cover classes were digitized so as not to reduce the accuracy of the overall dataset.

Training data selection: The test sites were accumulated over several years by reviewing peer-reviewed papers, reports, and news articles on plastic waste and its detection using remote sensing. This work continues as new sites are discovered and new versions of the model are generated. The test sites are separated into training/validation/testing datasets so that the data used to validate the model is not the same as the training data. Also, test data were chosen carefully as ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains which has identified as being caused by underspecification — where observed effects can have many possible causes. Also, as the plastics classes have low numbers of pixels compared to the broader land cover classes, such as clouds, there was class imbalance during the training. Therefore, in training the model, a re-weighting is applied to reduce the number of pixels for the classes with high numbers and increased the number of pixels for classes with low numbers through duplication.

Method: The project used a sequential artificial neural network (ANN) and post-ANN decision tree. The ANN on its own experienced confusion due to the broad range of environments it was applied within. A post-ANN decision tree allowed the user to decide whether the results were conservative or relaxed. For example, a conservative approach was adopted when time-series datasets were automatically processed to prevent a build-up of false positives which became distracting to users when observing composite outputs. See Lavender 2022 for further details.

Key metadata:

  • Geographical extent and coverage

  • Resolution, spectral wavelength range, date, and quality of input data

  • List of classes and number of observations of each

  • Links to literature sources identifying test sites

  • Designation of whether the data belongs to the training or test set

Implications for a TDS standard

In this use case, the TDS was compiled over multiple years due to the discovery of new test sites. Metadata identifying when a given TDS element was added, the test site it relates to, and whether it should be used within the training, validation, or test set, is valuable to a new user who may only want to use data related to a particular test site. A TDS standard must support a TDS to be updated over time including the addition of new entries which would also include being able to “version” the TDS, so that analyses can be compared over time as new entries are added.

While the entries in this use case came from the same input data, TDSs could be compiled from multiple input sources if they are maintained for a long period. This implies that individual elements of a TDS need to clearly capture the metadata of their associated input data.

5.3.6.  Use case: mapping coastal bathymetry using ML by combining multiple data sources.

Project Participants: Satellite-Derived Bathymetry (SDB) activities to increase the global coverage of accurate bathymetry maps.

Project Goal: This use case focuses on the use of a ML TDS to support the extraction of information from multiple types of EO data in support of extending the limited bathymetric data collection possible from vessels and airplanes.

Challenge: The techniques used for the detection of bathymetry varies according to both water depth and turbidity, and the TDS could therefore contain data from multiple source types that have different operating characteristics and uncertainties.

Input data: The input data can be any of the following.

  • Lidar optical data such as airborne LiDAR measurements, and satellite ICESat-2 data.

  • Multispectral optical data sources such as high resolution Landsat and Sentinel-2 data alongside very-high resolution satellite missions such as WorldView.

  • Sonar data from underwater instruments such as single and multi-beam echo sounders.

Training labels: The input data will be labeled with the bathymetric depth.

Training data selection: Depending on the location of interest, e.g., whether it is small area such as a port or global coverage, the source of the training data will vary.

Method: The approaches use ML methods such as random forest, e.g., TCarta who used ICESat-2 and Sagawa et al. 2022 who used multi-temporal satellite EO data to create a generalized model, and Zhong et al. 2022 who used a deep learning framework containing a 2D convolutional neural network.

Key metadata:

  • Geographical extent and coverage

  • Input source, including spatial resolution

  • So far, SDB data are not considered as hydrographic data (i.e., can be used for charting for navigational purposes) because of their lower accuracy and the difficulty of estimating uncertainties compared to data from conventional sensors (such as echo sounders and LiDAR). Therefore, uncertainties of the input TDS are vital for progress to be made.

Implications for a TDS standard

A TDS standard must consider how to describe data from multiple source types and their associated uncertainties. Also, the SDB TDS will need to store the location in terms of both horizontal (latitude, longitude) and vertical (depth below defined water surface or height above a reference surface) coordinates.

5.4.  Opportunities

As highlighted in the above case studies, ML has become widely used in the automated creation of insightful data products from EO data. As TDSs form the basis of ML approaches, a TDS standard has the potential to improve the quality and consistency of the application of ML to EO. This section covers specific opportunities that a TDS standard could enable.

The generation of a TDS is context-specific. The process is directly linked to the geospatial and temporal domain over which it is created, as well as the features it includes. A TDS standard would encourage TDS creators to provide this context along with the TDS. This allows future users to understand the TDS’s applicability to a new domains, or to refresh or augment the TDS to capture different features of interest. As TDSs are often time- and resource-intensive to create, improved reusability of TDSs would be valuable for the EO community.

Our world is constantly changing, and features captured by a TDSs may become outdated over time. As such, the ability to describe and trace changes to features, along with versioning of TDSs, is important for ensuring ML applications are using valid data. A TDSs standard can aid the cataloging and versioning of TDSs.

ML processes need high quality and consistent TDSs to perform well. This may relate to either consistency in labeling across the TDS, or measures of its similarity to associated ground truth data. Having provenance and automated quality metrics captured by the standard would serve to help creators serve reliable and consistent TDSs and provide users with confidence in the TDS.

As agencies begin to rely on ML to produce automated products from EO data, it is critical that they are well-informed when creating or procuring TDSs. A TDS standard would support these agencies to request metadata that enables use and reuse as described above, without needing deep ML expertise. Clear descriptions of TDS metadata would also allow ML projects to be worked on by multiple providers, helping set clear expectations between the TDS creator and the TDS user, and allowing for transfer of a TDS across multiple parties.

5.5.  Challenges

The case studies presented in this ER also highlight several challenges that must be considered in the development of a TDS standard. This section describes specific challenges that arise when working with TDSs and how these relate to the creation of a standard.

The use cases demonstrate that TDSs are created for highly specific domain problems. The challenge for a TDS standard will be to support creators in providing sufficient information about the domain. Without this, a new user cannot easily assess whether the TDS can be leveraged in their domain. Relevant domain information includes the following.

  • Total geographic extent

  • Spatial distribution of individual TDS elements

  • Date and time of labeling

  • Date and time of input data capture

  • Properties of the input data and labels, including (but not limited to):

    • the source of the input data (e.g., a specific satellite or LiDAR instrument);

    • any corrections applied to the source data (e.g., terrain correction, top of atmosphere correction);

    • the features of the input data (e.g., spectral bands, derived features);

    • any properties of those features (e.g., spectral range, definition of any derived features, spatial resolution); and

    • uncertainties associated with the input data or labels (e.g., positional uncertainty from GPS, depth uncertainty from SDB).

  • Designation to training, validation or test set, for individual TDS elements

  • Description of sampling strategy

  • Description of methods used to stratify the data

  • Description of class imbalances present in the TDS

There are many methods that can be used to create training labels or input data, and multiple of these may be used within a single TDS. This may affect the overall quality of a TDS (discussed further in Clause 9), and a new user may wish to include, exclude, or revise particular elements based on their creation method. A TDS standard will need to ensure each element in a TDS can be labeled with the following information.

  • Who created the label (with each individual assigned a unique ID), including (but not limited to)

    • a domain expert

    • a non-expert

    • a machine learning process

  • The process for creating the label, including (but not limited to)

    • labeled from imagery by a human

    • generated by a machine learning process

    • collected in the field by a human

  • Version history of the label

  • Any accuracy measures related to the label (e.g., GPS accuracy for field-collected labels)

  • The path to corresponding input data

  • The process for creating the corresponding input data, including (but not limited to)

    • direct from source

    • augmented from source (e.g., rotated, shifted, mirrored)

    • synthesized (e.g., generated by a Generative Adversarial Network (GAN) or from simulations)

The development of a standard for TDSs should anticipate that TDSs will evolve over time, as new algorithms are developed and popularized. The challenge for a TDS standard will be in capturing the critical domain information described above while remaining flexible enough to accommodate future changes in the way TDSs are generated.

By having a TDS standard there is potential for TDSs to become more interoperable This is due to users having information on the limits of the domain of application for a given TDS. As such, the standard needs to address the idea that a new TDS could comprise selected elements of existing TDSs and that the lineage is appropriately recorded.

6.  Current state of art

This section reviews previous and on-going activities relevant to the definition of an Artificial Intelligence (AI)/Machine Learning (ML) Training Dataset (TDS) Standard.

6.1.  Training Data Markup Language for Artificial Intelligence Draft Standard

[peng] recognized that existing TDS, including open source benchmarks, can lack discoverability and accessibility plus there is often no unified method to describe the Training Data (TD).

The Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) standard, released for internal review on August 2, 2022 is a conceptual model defined using Unified Modelling Language (UML) as a series of modules.

Figure 1 — TrainingDML-AI module overview.

TrainingDML-AI is designed as a universal information model that defines elements and attributes which are useful for a broad range of AI/ML applications. Any TD element may be augmented by additional attributes and relations whose names, data types, and values can be provided by a running application without requiring extensions to the TrainingDML-AI conceptual schema and respective encodings.

TrainingDML-AI builds on the ISO 19100 family of standards.

Figure 2 — Use of ISO Standards in TrainingDML-AI.

Annex B of the specification showcases the JSON encoding of example TDSs.

6.2.  SpatioTemporal Asset Catalog (STAC)

The goal of the SpatioTemporal Asset Catalog (STAC) family of specifications is to standardize the way geospatial asset metadata is structured and queried. Of relevance to this Testbed-18 activity are the following extensions, which are currently (as of 03 August 2022) classed as Work In Progress.

  • ML AOI: An Item and Collection extension to provide labeled training data for ML models.

  • ML Model: An Item and Collection extension to describe ML models that operate on Earth observation data.

The ML AOI (Area of Interest) extension relies on, but is distinct from, the existing label extension. STAC items using the label extension link label assets with the source imagery for which they are valid. This is often as a result of human labeling effort. By contrast STAC items using the ‘ml-aoi’ extension link label assets with raster items for each specific ML model that is being trained.

6.3.  ESA funded initiatives and projects such as AIREO

The European Space Agency (ESA) funded Artificial Intelligence (AI) Ready Earth Observation activity AIREO provides resources and tools to data creators and users to ensure their TDS are FAIR and to standardize aspects of TDS such as quality assurance and metadata completeness indicators.

The aim is for the AIREO TDS Specification to be applicable to all levels of predictive feature data and to other target variable types. The purpose of the first version of the specification was to generate feedback on the content and requirements from the community to ensure a more useful and relevant V1 specification. As such, the focus has been on a limited number of datasets with examples in the AIREO pilot datasets.

Figure 3 — Overview of the AIREO TDS specification.

Figure 4 shows the STAC implementation of the data model.

Figure 4 — STAC implementation of the AIREO data model.

There are also the AIREO Best Practice Guidelines that outline how to generate and document AIREO-compliant datasets following the AIREO specifications.

Separately, the AI4EO initiative supports the aims of ESA’s Φ-lab that are to accelerate the future of EO by means of transformational innovations. AI4EO hosts challenges that bring AI and EO together.

6.4.  ANZLIC considerations of TDSs as foundational data

In the geospatial domain, the United Nations Committee of Experts on Global Geospatial Information Management (the UNGGIM) has identified 14 Global Fundamental Geospatial Data Themes. These data themes are considered fundamental to support global initiatives, such as reporting on the Sustainable Development Goals. These themes are being adopted by nations across the world and are driving the need to create and maintain these datasets in an efficient and effective manner. In Australia, the Australian and New Zealand Land Information Council (ANZLIC) has adopted these themes and challenged jurisdictions to develop and maintain these data. The ANZLIC community is considering the role of ML in this activity (as described in Clause 5.3.1) and also how the TDSs themselves may be considered as a critical building block, and perhaps may in future be recognized as “foundational,” leading to the consideration of whether TDSs might in future become a recognized and authoritative data product.

6.5.  Public TDS repositories

6.5.1.  Kaggle

The TDS stored on Kaggle are in CSV, JSON, SQLite, and BigQuery as well as other formats.

When the word “satellite” is used to filter the datasets to find those that are most relevant, then CSV, JSON, and “other” are the predominant formats. NASA provides data in CSV, JSON, and NetCDF. “Other” includes a variety of image file formats, such as GeoTIFF, PNG, and JPG as well as Shapefiles. The commercial operator, Satellite Vu, has provided a wildfire dataset in the Tensor tfRecords binary format with TD stored in features.

Kaggle has a usability rating index that is a single number, maximum of 10. This number is used to rate how easy-to-use a dataset is based on a number of factors, including level of documentation, availability of related public content like kernels as references, file types, and coverage of key metadata.

6.5.2.  LuoJiaSET, Wuhan University

LuoJiaSET is an Open AI/ML Training Data Hub, with datasets collated from several sources including competitions, review articles and papers with code, Kaggle, blogs, and GitHub. LuoJiaSET is a draft TrainingDML-AI API implementation.

6.5.3.  Radiant Earth Foundation & Radiant MLHub

Radiant Earth Foundation is focused on applying ML for EO to meet the UN Sustainable Development Goals and is developing the ML Model Extension to STAC.

The Radiant MLHub hosts open ML TDSs and models generated by Radiant Earth Foundation, partners, and community. A Python client allows users to search and download TDSs. Users may also use other scripting languages and the REST Application Programming Interface (API).

There are several online TDSs focused on applications such as building detection, crop classification, flooding, land cover, and marine debris.

6.5.4.  SpaceNet

SpaceNet is a nonprofit organization founded in 2016 by IQT Labs’ CosmiQ Works and Maxar to accelerate open source geospatial ML. They run data challenges and release the TDSs, baseline algorithms, winning algorithms, and detailed evaluations under an open-source license.

As of 2020, the Radiant Earth Foundation announced the registration of a STAC-compliant version of SpaceNet’s high-quality geospatial labeled datasets for roads and buildings on Radiant MLHub. The broader SpaceNet Dataset is hosted as an Amazon Web Services (AWS) Public Dataset.

6.5.5.  Zenodo

Zenodo was originally developed by the European Organization for Nuclear Research (CERN) as part of an EC project to support Open Data. The goal was to be a catch-all repository for EC funded research. Through various sources of funding, CERN makes Zenodo publicly available. Advantages of using Zenodo are that DOIs are created and Zenodo automatically maintains a list of uses and citations.

Zenodo contains many ML TDS and users uploading data may choose the format of what is being uploaded. One example is the The WorldStrat Dataset that includes open high-resolution satellite imagery from Airbus supplied SPOT 6/7(1.5 m spatial resolution) paired with multi-temporal low-resolution satellite imagery from Sentinel-2 (10 m spatial resolution). The metadata are stored in a CSV file within the datasets, which are held in TAR gzipped files. As the WorldStrat creators wanted to lower the barrier to entry, the dataset and PyTorch DataLoader are provided in a format most accessible to the ML community. The code is also open-source and available on GitHub.

6.6.  Previous OGC activities

6.6.1.  Testbed-16

The OGC Testbed-16 Machine Learning (ML) task focused on understanding the potential of existing and emerging OGC standards for supporting ML applications in the context of wildland fire safety and response. Relevant recommendations for this broader activity included the following.

  • Make sure that datetimes are properly and accurately set in datasets.

  • Provide accuracy information in the metadata of each training dataset.

  • Establish a standard way to store and reuse a model.

As future work, the following was suggested.

  • There is a real need to work out a best practice for a generalizable metadata model (framework) for ML TDSs. The Key Elements for Metadata Content section contains several items that could form a basis of this best practice in the future.

  • Furthermore, Earth Observation (EO) datasets should be rendered AI-ready as described, for example, by the reference to this topic. Also in this context, efforts towards Analysis Ready Data (ARD) such as those proposed by the Committee on Earth Observation Satellites (CEOS) will likely become vital for future ML applications.

  • Solid and reliable ground truth datasets should be developed, including accuracy levels of the ML training data.  Summary of the Testbed-16 Metadata Content Section

The main points in the Testbed-16 ER metadata content section are as follows.

  • Metadata should at least contain statistical information about the data set, such as source, size, dimension, license, update status, and other elements, as well as of course features.

  • Creating and generating metadata for ML or research data and datasets in the ML training data “lifecycle” preserves the data in the long run and will also facilitate the use of ML training data for non-experts.

  • The reader is also referred to the CDB SWG research on metadata standards and common mandatory elements across standards.

A set of rules and recommendations from OGC Testbed-16 are as follows. (Source: DMPTool. Digital Curation: A How-To-Do-It Manual; Digital Curation Centre).

  • Consider what information is needed for the data to be read and interpreted in the future.

  • Understand requirements for data documentation and metadata. Several instructive examples can be found under the Funder Requirements section of the Data Management Plan Tool (DMPTool).

  • Consult available metadata standards for the domain of interest. Refer to Common Metadata Standards and Domain Specific Metadata Standards for details.

  • Describe data and datasets created in the research lifecycle, and use software programs and tools to assist in data documentation. Assign or capture administrative, descriptive, technical, structural, and preservation metadata for the data. Some potential information to document includes the following.

    • Descriptive metadata

      • Name of creator of data set

      • Name of author of document

      • Title of document

      • File name

      • Location of file

      • Size of file

    • Structural metadata

      • File relationships (e.g., child, parent)

    • Technical metadata

      • Format (e.g., text, SPSS, Stata, Excel, tiff, mpeg, 3D, Java, FITS, CIF)

      • Compression or encoding algorithms

      • Encryption and decryption keys

      • Software (including release number) used to create or update the data

      • Hardware on which the data were created

      • Operating systems in which the data were created

      • Application software in which the data were created

    • Administrative metadata

      • Information about data creation (e.g., date)

      • Information about subsequent updates, transformation, versioning, summarization

      • Descriptions of migration and replication

      • Information about other events that have affected the files

    • Preservation metadata

      • File format (e.g., .txt, .pdf, .doc, .rtf, .xls, .xml, .spv, .jpg, .fits)

      • Significant properties

      • Technical environment

      • Fixity information

  • Adopt a thesaurus in the relevant field [i.e., common terminology] or compile a data dictionary for the dataset.

  • Obtain persistent identifiers (e.g., DOI) for datasets, if possible, to ensure data can be found in the future.

6.6.2.  Testbed-15

The Testbed-15 activity explored the ability of ML to interact with and use OGC Web Service Standards (OWS) in the context of natural resources applications; including WPS, WFS, and CSW.

The work exercised OGC standards using five different scenarios incorporating use cases that included traditional ML techniques for image recognition; understanding the linkages between different terms to identify a dataset; and vectorization of identified water bodies using satellite imagery. The Testbed-15 Engineering Report noted that the web service-based standards would soon be complemented by OGC API Standards based on OpenAPI descriptions and RESTful principals.

The Testbed recommendations were primarily linked to the OGC standards. However, the ER noted that even if the source code used to implement predictive models is kept static, the behavior of the models can change due to the varying availability and constant evolution of their training data. This affects the reliability of models and reproducibility of the experiments, which is a cornerstone of scientific research. Keeping track of changes in data is not an easy task, as Version Control Systems (VCS) are not typically made to track large binary files and solutions for small projects are often limited to locally hosted datasets that are not frequently updated.

Adding rigorous metadata fields related to data sources and modification times to standardized web service requests were seen as greatly improving the robustness of ML training and evaluation services.

6.6.3.  Testbed-14

The Testbed-14 ML activity also focused on how to support and integrate emerging AI and ML tools using OWS, as well as publishing their input and outputs. A proof-of-concept client application executed processes offered by the ML system and displayed its results found in an Image and Feature Repository.

7.  Metadata requirements and recommendations

Metadata are crucial for ensuring lossless data interchange and their appropriate use. Metadata can be created automatically during data capture (e.g., timestamps of a data record, or an automatic label of data production software), or added before advertising the data object to provide context for understanding the creation of a dataset (e.g., through detailed description of dataset’s provenance information).

7.1.  Current structure and usage of metadata in ML TDS

As outlined in Clause 6, most current ML TDS models use the STAC family of specifications as the basis to structure the TDS and related metadata. The STAC specification defines only a limited set of ‘STAC Core Metadata’ elements used for STAC Catalog ‘Collection’ and STAC Catalog ‘Item’. The following core metadata elements are required.

  • Basic metadata to provide an overview of a STAC Item

    • title

    • description

  • Date and Time definition

    • datetime, created and updated — to allow recording of temporal capture via information about

    • start_datetime and end_datetime — to allow specification of ranges of capture datetimes

  • License information for data and metadata

  • Provider information — to allow defining information about provider (e.g., name, description, and url) and their roles (e.g., processor, producer, licensor, host)

  • Instrument information — to allow specifying the information about platform, instrument, mission, constellation and ground sampling distance used for data acquisition

Given its modular nature, the STAC specification allows enhancing the metadata definition of STAC objects through extensions. One of the stable STAC extensions recommended for definition of ML items and collections (see Clause 6) is the ‘Scientific Citation Extension’. This extends STAC core metadata elements with reference information about which publication a STAC object originates and how it should be cited or referenced. Additional scientific citation metadata, such as the Digital Object Identifier (DOI), citation, and indication of relevant publications help to increase reproducibility and findability of a STAC object, and thus improving its FAIRness (more detail in Clause 10).

7.2.  Review and application of ISO metadata standards for ML TDS

This section discusses four key ISO metadata standards: ISO 19115-1, ISO 19115-2, ISO 19157-1, and ISO 19157-3.

7.2.1.  ISO 19115-1 and ISO 19115-2 for geographic information

ISO 19115-1:2014 Geographic information — Metadata — Part 1: Fundamentals

ISO 19115-1:2014 defines the schema required for metadata about geographic datasets and services. The standard defines the structure for information about data and metadata identification, spatial and temporal extent, quality, distribution, and licenses. This standard is applicable to the definition of metadata catalogs (typically used in a Spatial Data Infrastructure — SDI) as well as for describing geographic resources of various kinds (i.e., datasets or services, maps, charts, or textual documents about geographic resources) and at various levels of detail (e.g., dataset, feature, or attribute). Figure 5 illustrates the metadata schema defined in ISO 19115-1.

Figure 5 — ISO 19115-1 Metadata schema.

ISO 19115-1:2014 also identifies the minimum metadata set required to serve most metadata applications, including data discovery, access, transfer, and use, and a decision on dataset’s fitness for use (see Table 1).

Table 1 — Metadata for the discovery of geographic datasets

Metadata elementObligationComment
Metadata reference informationOptionalUnique identifier for the metadata
Resource titleMandatoryTitle by which the resource is known
Resource reference dateOptionalA date which is used to help identify the resource
Resource identifierOptionalUnique identifier for the resource
Resource point of contactOptionalName, affiliation, and role of the person responsible for the resource
Geographic locationConditionalGeographic coordinates or description of metadata location — Mandatory if the described resource is not a ‘dataset’.
Resource languageConditionalLanguage used to describe the resource — Mandatory if other than default (English).
Resource topic categoryConditionalA selection from the list of topics defined in ISO 19115-1 — Mandatory if the described resource is not a ‘dataset’ or a ‘dataset series’.
Spatial resolutionOptionalThe nominal scale and/or spatial resolution of the resource
Resource typeConditionalISO 19115-1 standard code (e.g., dataset, feature, attribute, product) the metadata describe — Mandatory if the described resource is not a ‘dataset’.
Resource abstractMandatoryA brief description of the content of the resource
Extent information for the datasetOptionalTemporal or vertical extent of the resource
Resource lineage/provenanceOptionalSource and production steps used in producing the resource.
Resource on-line linkOptionalURL for the resource.
KeywordsOptionalWords and phrases describing the resource to be indexed and searched.
Constraint on the resource access and useOptionalRestrictions on the access and use of the resource.
Metadata date stampMandatoryReference date for the creation (and update) of metadata
Metadata point of contactMandatoryThe party responsible for the metadata.

Note in Table 1 that if a described resource is a ‘dataset’ or a ‘dataset series’ ISO 19115-1 mandates only four metadata elements to describe a such resource.

  1. Resource title

  2. Resource abstract

  3. Metadata date stamp

  4. Metadata point of contact

Arguably, this is insufficient to ensure findability, accessibility, interoperability, and reuse of a resource, especially by machines, which is often the case in ML.

ISO 19115-2:2019 Geographic information — Metadata — Part 2: Extensions for acquisition and processing

ISO 19115-2:2019 extends ISO 19115-1:2014 by defining the schema required for describing the acquisition and processing of geographic information, including imagery. This standard defines the structure for describing properties of measuring systems and the numerical methods and computational procedures used to derive geographic information from the acquired data. Figure 6 illustrates the metadata schema defined in ISO 19115-2.