I. Abstract
The Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Standard aims to develop the UML model and encodings for geospatial machine learning training data. Training data plays a fundamental role in Earth Observation (EO) Artificial Intelligence Machine Learning (AI/ML), especially Deep Learning (DL). It is used to train, validate, and test AI/ML models. This Standard defines a UML model and encodings consistent with the OGC Standards baseline to exchange and retrieve the training data in the Web environment.
The TrainingDML-AI Standard provides detailed metadata for formalizing the information model of training data. This includes but is not limited to the following aspects:
How the training data is prepared, such as provenance or quality;
How to specify different metadata used for different ML tasks such as scene/object/pixel levels;
How to differentiate the high-level training data information model and extended information models specific to various ML applications; and
How to introduce external classification schemes and flexible means for representing ground truth labeling.
II. Keywords
The following are keywords to be used by search engines and document catalogues.
ogcdoc, OGC document, artificial intelligence, machine learning, deep learning, earth observation, remote sensing, training data, training sample, UML
III. Security considerations
No security considerations have been made for this Standard.
IV. Submitting Organizations
The following organizations submitted this Document to the Open Geospatial Consortium (OGC):
- Wuhan University
- Pixalytics Ltd
- WiSC Enterprises
- George Mason University
- Laboratoire d'Informatique de Grenoble
- South Digital Technology Co., Ltd
- Wuhan University of Technology
- Hubei University
- Chongqing Changan Automobile Co., Ltd
V. Submitters
All questions regarding this submission should be directed to the editor or the submitters:
Name | Affiliation |
Peng Yue | Wuhan University |
Jianya Gong | Wuhan University |
Ruixiang Liu | Wuhan University |
Dayu Yu | Wuhan University |
Samantha Lavender | Pixalytics Ltd |
Jim Antonisse | WiSC Enterprises |
Liping Di | George Mason University |
Eugene Yu | George Mason University |
Danielle Ziébelin | Laboratoire d’Informatique de Grenoble |
Boyi Shangguan | South Digital Technology Co., Ltd |
Lei Hu | South Digital Technology Co., Ltd |
Liangcun Jiang | Wuhan University of Technology |
Mingda Zhang | Hubei University |
Kai Yan | Chongqing Changan Automobile Co., Ltd |
VI. Acknowledgements
Thanks to the members of the TrainingDML-AI Standards Working Group of the OGC as well as all contributors of change requests and comments. In particular: Scott Simmons, Carl Reed, Ivana Ivánová, Emily Daemen, Jon Duckworth, Zheheng Liang, Jibo Xie, Yuqi Bai, Winnie Shiu, Ignacio Correas, Chenxiao Zhang, Zhipeng Cao, Haofeng Tan, Yinyin Pan, Hanwen Xu, Shuaiqi Liu, Hao Li, Ming Wang, Kaixuan Wang, Haipeng Deng, Gobe Hobona, Chris Little, Kathi Schleidt, Rodolfo Orozco.
OGC Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Part 1: Conceptual Model Standard
1. Scope
Training data is the building block of machine learning models. These models now constitute the majority of machine learning applications in Earth science. Training data is used to train AI/ML models, and to then validate model results. Formalizing and documenting the training data by characterizing the training data content, metadata, data quality, and provenance, and so forth is essential.
This OGC Training Data Standard describes work actions around training data:
Documents the UML model with a target of maximizing the interoperability and usability of EO imagery training data;
Defines different AI/ML tasks and labels in earth observation in terms of supervised learning, including scene level, object level and pixel level tasks;
Describes the description of the permanent identifier, version, license, training data size, measurement or imagery used for annotation, and so on; and
Defines the description of quality (e.g., training data errors, training data unrepresentativeness) and the provenance (labeling, labeler, and labeling procedure).
2. Conformance
This TrainingDML-AI Standard defines a conceptual model that is independent of any encoding or formatting technologies. The standardization targets for this Standard is:
TrainingDML-AI Conceptual Model
Conformance with this Standard shall be checked using all the relevant tests specified in Annex A (normative) of this document. The framework, concepts, and methodology for testing, and the criteria to be achieved to claim conformance are specified in the OGC Compliance Testing Policies and Procedures and the OGC Compliance Testing web site.
All requirements-classes and conformance-classes described in this document are owned by the standard identified.
3. Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
ISO: ISO 19103:2015, Geographic information — Conceptual schema language. International Organization for Standardization, Geneva (2015). https://www.iso.org/standard/56734.html.
ISO: ISO 19107:2019, Geographic information — Spatial schema. International Organization for Standardization, Geneva (2019). https://www.iso.org/standard/66175.html.
ISO: ISO 19115-1:2014, Geographic information — Metadata — Part 1: Fundamentals. International Organization for Standardization, Geneva (2014). https://www.iso.org/standard/53798.html.
ISO: ISO 19157-1, Geographic information — Data quality — Part 1: General requirements. International Organization for Standardization, Geneva https://www.iso.org/standard/78900.html.
4. Terms and definitions
This document uses the terms defined in OGC Policy Directive 49, which is based on the ISO/IEC Directives, Part 2, Rules for the structure and drafting of International Standards. In particular, the word “shall” (not “must”) is the verb form used to indicate a requirement to be strictly followed to conform to this document and OGC documents do not use the equivalent phrases in the ISO/IEC Directives, Part 2.
This document also uses terms defined in the OGC Standard for Modular specifications (OGC 08-131r3), also known as the ‘ModSpec’. The definitions of terms such as standard, specification, requirement, and conformance test are provided in the ModSpec.
For the purposes of this document, the following additional terms and definitions apply.
4.1. Artificial Intelligence (AI)
refers to a set of methods and technologies that can empower machines or software to learn and perform tasks like humans.
4.2. Machine Learning (ML)
is an important branch of artificial intelligence that gives computers the ability to improve their performance without explicitly being programmed to do so. ML processes create models from training data by using a set of learning algorithms, and then can use these models to make predictions. Depending on whether the training data include labels, the learning algorithms can be divided into supervised and unsupervised learning.
4.3. Deep Learning (DL)
is a subset of machine learning, which is essentially a neural network with three or more layers. The number of layers is referred to as depth. While a neural network with a single layer can still make approximate predictions, additional hidden layers can help to optimize and refine for accuracy.
4.4. Training Dataset
a collection of samples, often labeled in terms of supervised learning. A training dataset can be divided into training, validation, and test sets. Training samples are different from samples in OGC Observations & Measurements (O&M). They are often collected in purposive ways that deviate from purely probability sampling, with known or expected results labeled as values of a dependent variable for generating a trained predictive model.
4.5. Label
refers to known or expected results annotated as values of a dependent variable in training samples. A training sample label is different from those on a geographical map, which are known as map labels or annotations.
4.6. Provenance
information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. In this standard provenance is a record of how training data were prepared.
4.7. Quality
degree to which a set of inherent characteristics fulfils requirements [ISO 9000:2005, definition3.1.1]. Quality of training data (such as data imbalance and mislabeling) can impact the performance of AI/ML models.
4.8. Earth Observation
data and information collected about our planet, whether atmospheric, oceanic or terrestrial. This includes space-based or remotely-sensed data, as well as ground-based or in situ data.
4.9. Scene Classification
task of identifying scene categories of images, on the basis of a training set of images whose scene categories are known.
4.10. Object Detection
task of recognizing objects such as cars from images. The objects are often localized using bounding boxes.
4.11. Semantic Segmentation
task of assigning class labels to pixels of images or points of point clouds.
4.12. Change Detection
task that finds the changes in an area between images taken at different times.
4.13. 3D Model Reconstruction
task that builds 3D objects and scenes from multi-view images.
4.14. Generative Model
is one of the methods of large model training, which improve model performance through unsupervised pre-training. In the fine-tuning phase, labeled data plays a critical role in optimizing the model for specific vertical domains or tasks. By incorporating labeled data, the model can learn to accurately identify and extract relevant features, leading to better performance on specific downstream tasks. Overall, the combination of generative models and fine-tuning with labeled data can significantly improve the performance of large models in specialized domains or tasks.
5. Conventions
This section provides details and examples for any conventions used in the document. Examples of conventions are symbols, abbreviations, use of XML schema, or special notes regarding how to read the document.
5.1. Identifiers
The normative provisions in this specification are denoted by the URI
http://www.opengis.net/spec/TrainingDML-AI-1/1.0
All requirements and conformance tests that appear in this document are denoted by partial URIs which are relative to this base.
5.2. Abbreviated terms
In this document, the following abbreviations and acronyms are used or introduced:
AI
Artificial Intelligence
DL
Deep Learning
EO
Earth Observation
ISO
International Organization for Standardization
JSON
JavaScript Object Notation
LC
Land Cover
LU
Land Use
ML
Machine Learning
OGC
Open Geospatial Consortium
RS
Remote Sensing
TD
Training Data
UML
Unified Modeling Language
XML
Extensible Markup Language
5.3. UML Notation
The Standard is presented in this document through diagrams using the Unified Modeling Language (UML) static structure diagram. The UML notations used in this Standard are described in the diagram in Figure 1.
Figure 1 — UML notation (see ISO TS 19103, Geographic information — Conceptual schema language).
All associations between model elements in the TrainingDML-AI Conceptual Model are uni-directional. Thus, associations in the model are navigable in only one direction. The direction of navigation is depicted by an arrowhead. In general, the context an element takes within the association is indicated by its role. The role is displayed near the target of the association. If the graphical representation is ambiguous though, the position of the role has to be drawn to the element the association points to.
The following stereotypes are used in this model.
«DataType» defines a set of properties that lack identity. A data type is a classifier with no operations, whose primary purpose is to hold information.
«CodeList» enumerates the valid attribute values. In contrast to Enumeration, the list of values is open and, thus, not given inline in the TrainingDML-AI UML Model. The allowed values can be provided within an external code list.
6. Overview
The TrainingDML-AI Conceptual Model Standard defines how to represent and exchange ML training data. The conceptual model includes the most relevant training data entities from datasets, to instances (i.e. individual training samples), to labels. The conceptual schema specifies how and into which parts of the training data should be decomposed and classified.
The TrainingDML-AI conceptual model (Clause 7) is formally specified using UML class diagrams, complemented by a data dictionary (Clause 8) providing the definitions and explanations of the object classes and attributes. This conceptual model provides the basis for specifying encoding implemented in languages such as JSON, or XML.
6.1. AI Tasks for EO
In recent years AI/ML is increasingly used in the EO domain. The new AI/ML algorithms frequently require large training datasets as benchmarks. AI/ML TD have been used in many EO applications to calibrate the performance of AI/ML models. Many efforts have been made to produce training datasets to make accurate predictions. As a result, a number of training datasets are publicly available, with new datasets being constantly released. In the EO domain, examples of AI/ML training datasets have been developed in various tasks including the following typical scenarios:
Scene classification. These algorithms determine image categories from numerous pictures (e.g., agricultural, forest, and beach scenes). The training samples are a series of labeled pictures. The data can be either from satellite, drones, or aircrafts. The metadata of the datasets often includes the number of training samples, the number of classes, and the image size.
Object detection. These algorithms detect and localize different objects (e.g., airplanes, cars and building) in a single image. The image can be optical or non-optical, such as Synthetic Aperture Radar (SAR). Recent work also suggests an increasing focus on object detection from street view imagery. Objects can be labeled with two forms of bounding boxes, i.e., oriented and horizontal bounding boxes. The geometry of a bounding box can be expressed using top-left/bottom-right coordinates, coordinates of four corners, or center coordinates along with the length and width of the box.
Semantic segmentation. In terms of Land cover (LC) and land use (LU) classification, this process assigns a LC/LU class label to a pixel (or groups of pixels) of RS imagery. Considering semantic segmentation of 3D point clouds, it is to classify points of a 3D point cloud into categories. TDs are usually composed of RS images/point clouds, and the corresponding labeled value of each pixel/point recording its class.
Change detection. These algorithms identify the difference between images acquired over the same geographical area but taken at different times. The TD comprise a set of pre-change and post-change RS images, with the corresponding ground truth map labeled changed and unchanged pixels. The image can be optical or SAR images.
3D model reconstruction. These algorithms infer the 3D geometry and structure of objects and scenes, mainly realized from the dense matching of multi-view images. The TD are usually composed of two-view or multi-view images, with the corresponding disparity map or depth maps as ground truth respectively.
6.2. Modularization
The TrainingDML-AI conceptual model provides models for the most important elements within TD. These elements have been identified to be either required or important in many different AI/ML tasks. However, implementations are not required to support the complete TrainingDML-AI model in order to be conformant to the Standard. Implementations may employ a subset of constructs according to their specific information needs. For this purpose, modularization is applied to the TrainingDML-AI.
Figure 2 — TrainingDML-AI module overview.
As shown in Figure 2, the TrainingDML-AI conceptual model is thematically decomposed into a Basic module, a Provenance module, a Quality module and a Changeset module. The Basic module comprises the basic concepts and elements, including AI_TrainingDataset, AI_TrainingData, AI_Label, and AI_Task, of the TrainingDML-AI, and thus, must be implemented by any conformant system. The Provenance module provides a comprehensive definition of provenance by AI_Labeling, AI_Labeler, and AI_Labeling Procedure. The Quality module offers quality description of TD with AI_DataQuality elements. And the Changeset module defines AI_TDChangeset between versions of datasets.
6.3. General Modeling Principles
6.3.1. Element Modeling
The modeling of all elements in the TrainingDML-AI conceptual model has the following principles (Reference [7]).
Granularity. Two levels of granularity are differentiated in the conceptual model: The Training Dataset is used to refer to the collection level, and the Training Data is used to refer to the individual level.
Label semantics. The training dataset will not be limited to one classification scheme. External classification schemes should be allowed to be linked into the Training Dataset to accommodate different cases in practice.
Light-weight design. The lightweight designed conceptual model has a minimum set of metadata elements, provenance, or quality measures at the collection level instead of at the individual level. This is to facilitate the understanding of the dataset and improve the scalability for communicating large training datasets.
Alignment. The modeling of elements in TDs can leverage existing efforts for wide adoption, such as for ISO 19109 Geographic information — Rules for application schema, ISO 19115-1 Geographic information — Metadata — Part 1: Fundamentals, ISO 19157-1 Geographic information — Data quality — Part 1: General requirements, and the OGC Geography Markup Language (GML) Standard. The conceptual model can be aligned with these existing standards and leverage capabilities fulfilled in part by other standards.
Quality, bias, and ethics. Elements related to quality, or more specifically, bias that can be used to reduce the errors when using AI/ML. For example, any knowledge of the TD imbalance and mislabeling can be stored in TD quality. In addition, data ethics aims to safeguard the responsible use of TD, and it can be addressed by using the license property in the TD.
Changeset. This will be an optional module in TD modeling. Changeset addresses how to capture changes in TD datasets. The change model considers the trend in TD collections to use the crowdsourcing platforms and borrow the change representation from the platforms such as OpenStreetMap.
6.3.2. Class Hierarchy and Inheritance of Properties and Relations
In the TrainingDML-AI conceptual model, the specific elements such as EO training datasets, EO training data, scene label, object label, and pixel label are defined as subclasses of more general higher-level classes. Hence, elements build a hierarchy along specialization / generalization relationships where more specialized elements inherit the properties and relationships of all their super classes along the entire generalization path to the topmost element.
6.3.3. Definition of the Semantics for all Classes, Properties, and Relations
The meanings of all elements defined in the TrainingDML-AI conceptual model are normatively specified in the data dictionary in Clause 8.
6.3.4. Data Integrity, Authenticity, and Non-repudiation
Sometimes training datasets can be downloaded, disseminated, and changed by anyone. The data integrity, authenticity, and non-repudiation are important to ensure unexpected bias propagation and distorted results. Currently the standard focuses on the information modeling, while data dissemination can be enriched with strategies from the general information domain by publishing hashes (e.g., MD5) and public-keys (e.g., RSA) after signing and encrypting.
6.4. Extending TrainingDML-AI
The TrainingDML-AI conceptual model is designed as a universal information model that defines elements and attributes which are useful for a broad range of AI/ML applications. In practical AI/ML applications, the elements within specific TDs will most likely contain attributes which are not explicitly modeled in TrainingDML-AI. Moreover, there might be TD elements which are not covered by the TrainingDML-AI thematic classes.
The model provides an abstract class-based method to support the exchange of such data. Elements not represented by the predefined thematic classes of the model may be modeled and exchanged by extending abstract class.
7. TrainingDML-AI UML Model
The TrainingDML-AI UML model is the normative definition of the TrainingDML-AI Conceptual Model. The tables and figures in this section were software generated from the UML model. As such, this section provides a normative representation of the TrainingDML-AI Conceptual Model.
7.1. ISO Dependencies
TrainingDML-AI builds on the ISO 19100 family of standards. The applicable standards are identified in Figure 3. Data dictionaries are included for all the ISO-defined classes explicitly referenced in the TrainingDML-AI UML model. These data dictionaries are provided for the convenience of the user. The ISO standards are the normative source.
Figure 3 — Use of ISO Standards in TrainingDML-AI.
The ISO classes explicitly used in the TrainingDML-AI UML model are introduced in Table 1. Further details about these classes can be found in the Data Dictionary within Clause 8.
Table 1 — ISO Classes used in TrainingDML-AI
Class Name | Description |
---|---|
Feature | Abstraction of real world phenomena. |
MD_Band | Range of wavelengths in the electromagnetic spectrum. |
MD_Scope | The target resource and physical extent for which information is reported. |
EX_Extent | Extent of the resource. |
CI_Citation | Standardized resource reference. |
DataQuality | Quality information for the data specified by a data quality scope. |
QualityElement | Aspect of quantitative quality information. |
7.2. Overview of the UML Model
The UML model is presented with core concepts in Figure 4, followed by the concrete classes in Figure 5. The following describes the core concepts.
AI_TrainingDataset: This concept represents a collection of training samples, i.e. a training dataset.
AI_TrainingData: This concept is an individual training sample in a training dataset.
AI_Task: This concept is used to identify the task that the training dataset is used for.
AI_Label: This concept represents the label semantics for TD.
AI_Labeling: This concept provides the provenance of how TD are created.
AI_TDChangeset: This concept records types of TD changes between two versions of the training dataset.
AI_DataQuality: This concept is associated with a training dataset to document its quality.
Figure 4 — Core concepts.
The full overview of concrete classes and attributes are presented in Figure 5. Concepts related to the EO AI/ML applications are defined as classes extended from abstract classes. Each core concept with related classes will be described in the rest subsections.