The Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Standard aims to develop the UML model and encodings for geospatial machine learning training data. Training data plays a fundamental role in Earth Observation (EO) Artificial Intelligence Machine Learning (AI/ML), especially Deep Learning (DL). It is used to train, validate, and test AI/ML models. This Standard defines a UML model and encodings consistent with the OGC Standards baseline to exchange and retrieve the training data in the Web environment.
The TrainingDML-AI Standard provides detailed metadata for formalizing the information model of training data. This includes but is not limited to the following aspects:
How the training data is prepared, such as provenance or quality;
How to specify different metadata used for different ML tasks such as scene/object/pixel levels;
How to differentiate the high-level training data information model and extended information models specific to various ML applications; and
How to introduce external classification schemes and flexible means for representing ground truth labeling.
The following are keywords to be used by search engines and document catalogues.
ogcdoc, OGC document, artificial intelligence, machine learning, deep learning, earth observation, remote sensing, training data, training sample, UML
III. Security considerations
No security considerations have been made for this Standard.
IV. Submitting Organizations
The following organizations submitted this Document to the Open Geospatial Consortium (OGC):
- Wuhan University
- Pixalytics Ltd
- WiSC Enterprises
- George Mason University
- Laboratoire d'Informatique de Grenoble
- South Digital Technology Co., Ltd
- Wuhan University of Technology
- Hubei University
- Chongqing Changan Automobile Co., Ltd
All questions regarding this submission should be directed to the editor or the submitters:
|Peng Yue||Wuhan University|
|Jianya Gong||Wuhan University|
|Ruixiang Liu||Wuhan University|
|Dayu Yu||Wuhan University|
|Samantha Lavender||Pixalytics Ltd|
|Jim Antonisse||WiSC Enterprises|
|Liping Di||George Mason University|
|Eugene Yu||George Mason University|
|Danielle Ziébelin||Laboratoire d’Informatique de Grenoble|
|Boyi Shangguan||South Digital Technology Co., Ltd|
|Lei Hu||South Digital Technology Co., Ltd|
|Liangcun Jiang||Wuhan University of Technology|
|Mingda Zhang||Hubei University|
|Kai Yan||Chongqing Changan Automobile Co., Ltd|
Thanks to the members of the TrainingDML-AI Standards Working Group of the OGC as well as all contributors of change requests and comments. In particular: Scott Simmons, Carl Reed, Ivana Ivánová, Emily Daemen, Jon Duckworth, Zheheng Liang, Jibo Xie, Yuqi Bai, Winnie Shiu, Ignacio Correas, Chenxiao Zhang, Zhipeng Cao, Haofeng Tan, Yinyin Pan, Hanwen Xu, Shuaiqi Liu, Hao Li, Ming Wang, Kaixuan Wang, Haipeng Deng, Gobe Hobona, Chris Little, Kathi Schleidt, Rodolfo Orozco.
OGC Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Part 1: Conceptual Model Standard
Training data is the building block of machine learning models. These models now constitute the majority of machine learning applications in Earth science. Training data is used to train AI/ML models, and to then validate model results. Formalizing and documenting the training data by characterizing the training data content, metadata, data quality, and provenance, and so forth is essential.
This OGC Training Data Standard describes work actions around training data:
Documents the UML model with a target of maximizing the interoperability and usability of EO imagery training data;
Defines different AI/ML tasks and labels in earth observation in terms of supervised learning, including scene level, object level and pixel level tasks;
Describes the description of the permanent identifier, version, license, training data size, measurement or imagery used for annotation, and so on; and
Defines the description of quality (e.g., training data errors, training data unrepresentativeness) and the provenance (labeling, labeler, and labeling procedure).
This TrainingDML-AI Standard defines a conceptual model that is independent of any encoding or formatting technologies. The standardization targets for this Standard is:
TrainingDML-AI Conceptual Model
Conformance with this Standard shall be checked using all the relevant tests specified in Annex A (normative) of this document. The framework, concepts, and methodology for testing, and the criteria to be achieved to claim conformance are specified in the OGC Compliance Testing Policies and Procedures and the OGC Compliance Testing web site.
All requirements-classes and conformance-classes described in this document are owned by the standard identified.
3. Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
ISO: ISO 19103:2015, Geographic information — Conceptual schema language. International Organization for Standardization, Geneva (2015). https://www.iso.org/standard/56734.html.
ISO: ISO 19107:2019, Geographic information — Spatial schema. International Organization for Standardization, Geneva (2019). https://www.iso.org/standard/66175.html.
ISO: ISO 19115-1:2014, Geographic information — Metadata — Part 1: Fundamentals. International Organization for Standardization, Geneva (2014). https://www.iso.org/standard/53798.html.
ISO: ISO 19157-1, Geographic information — Data quality — Part 1: General requirements. International Organization for Standardization, Geneva https://www.iso.org/standard/78900.html.
4. Terms and definitions
This document uses the terms defined in OGC Policy Directive 49, which is based on the ISO/IEC Directives, Part 2, Rules for the structure and drafting of International Standards. In particular, the word “shall” (not “must”) is the verb form used to indicate a requirement to be strictly followed to conform to this document and OGC documents do not use the equivalent phrases in the ISO/IEC Directives, Part 2.
This document also uses terms defined in the OGC Standard for Modular specifications (OGC 08-131r3), also known as the ‘ModSpec’. The definitions of terms such as standard, specification, requirement, and conformance test are provided in the ModSpec.
For the purposes of this document, the following additional terms and definitions apply.
4.1. Artificial Intelligence (AI)
refers to a set of methods and technologies that can empower machines or software to learn and perform tasks like humans.
4.2. Machine Learning (ML)
is an important branch of artificial intelligence that gives computers the ability to improve their performance without explicitly being programmed to do so. ML processes create models from training data by using a set of learning algorithms, and then can use these models to make predictions. Depending on whether the training data include labels, the learning algorithms can be divided into supervised and unsupervised learning.
4.3. Deep Learning (DL)
is a subset of machine learning, which is essentially a neural network with three or more layers. The number of layers is referred to as depth. While a neural network with a single layer can still make approximate predictions, additional hidden layers can help to optimize and refine for accuracy.
4.4. Training Dataset
a collection of samples, often labeled in terms of supervised learning. A training dataset can be divided into training, validation, and test sets. Training samples are different from samples in OGC Observations & Measurements (O&M). They are often collected in purposive ways that deviate from purely probability sampling, with known or expected results labeled as values of a dependent variable for generating a trained predictive model.
refers to known or expected results annotated as values of a dependent variable in training samples. A training sample label is different from those on a geographical map, which are known as map labels or annotations.
information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. In this standard provenance is a record of how training data were prepared.
degree to which a set of inherent characteristics fulfils requirements [ISO 9000:2005, definition3.1.1]. Quality of training data (such as data imbalance and mislabeling) can impact the performance of AI/ML models.
4.8. Earth Observation
data and information collected about our planet, whether atmospheric, oceanic or terrestrial. This includes space-based or remotely-sensed data, as well as ground-based or in situ data.
4.9. Scene Classification
task of identifying scene categories of images, on the basis of a training set of images whose scene categories are known.
4.10. Object Detection
task of recognizing objects such as cars from images. The objects are often localized using bounding boxes.
4.11. Semantic Segmentation
task of assigning class labels to pixels of images or points of point clouds.
4.12. Change Detection
task that finds the changes in an area between images taken at different times.
4.13. 3D Model Reconstruction
task that builds 3D objects and scenes from multi-view images.
4.14. Generative Model
is one of the methods of large model training, which improve model performance through unsupervised pre-training. In the fine-tuning phase, labeled data plays a critical role in optimizing the model for specific vertical domains or tasks. By incorporating labeled data, the model can learn to accurately identify and extract relevant features, leading to better performance on specific downstream tasks. Overall, the combination of generative models and fine-tuning with labeled data can significantly improve the performance of large models in specialized domains or tasks.
This section provides details and examples for any conventions used in the document. Examples of conventions are symbols, abbreviations, use of XML schema, or special notes regarding how to read the document.
The normative provisions in this specification are denoted by the URI
All requirements and conformance tests that appear in this document are denoted by partial URIs which are relative to this base.
5.2. Abbreviated terms
In this document, the following abbreviations and acronyms are used or introduced:
International Organization for Standardization
Open Geospatial Consortium
Unified Modeling Language
Extensible Markup Language
5.3. UML Notation
The Standard is presented in this document through diagrams using the Unified Modeling Language (UML) static structure diagram. The UML notations used in this Standard are described in the diagram in Figure 1.
Figure 1 — UML notation (see ISO TS 19103, Geographic information — Conceptual schema language).
All associations between model elements in the TrainingDML-AI Conceptual Model are uni-directional. Thus, associations in the model are navigable in only one direction. The direction of navigation is depicted by an arrowhead. In general, the context an element takes within the association is indicated by its role. The role is displayed near the target of the association. If the graphical representation is ambiguous though, the position of the role has to be drawn to the element the association points to.
The following stereotypes are used in this model.
«DataType» defines a set of properties that lack identity. A data type is a classifier with no operations, whose primary purpose is to hold information.
«CodeList» enumerates the valid attribute values. In contrast to Enumeration, the list of values is open and, thus, not given inline in the TrainingDML-AI UML Model. The allowed values can be provided within an external code list.
The TrainingDML-AI Conceptual Model Standard defines how to represent and exchange ML training data. The conceptual model includes the most relevant training data entities from datasets, to instances (i.e. individual training samples), to labels. The conceptual schema specifies how and into which parts of the training data should be decomposed and classified.
The TrainingDML-AI conceptual model (Clause 7) is formally specified using UML class diagrams, complemented by a data dictionary (Clause 8) providing the definitions and explanations of the object classes and attributes. This conceptual model provides the basis for specifying encoding implemented in languages such as JSON, or XML.
6.1. AI Tasks for EO
In recent years AI/ML is increasingly used in the EO domain. The new AI/ML algorithms frequently require large training datasets as benchmarks. AI/ML TD have been used in many EO applications to calibrate the performance of AI/ML models. Many efforts have been made to produce training datasets to make accurate predictions. As a result, a number of training datasets are publicly available, with new datasets being constantly released. In the EO domain, examples of AI/ML training datasets have been developed in various tasks including the following typical scenarios:
Scene classification. These algorithms determine image categories from numerous pictures (e.g., agricultural, forest, and beach scenes). The training samples are a series of labeled pictures. The data can be either from satellite, drones, or aircrafts. The metadata of the datasets often includes the number of training samples, the number of classes, and the image size.
Object detection. These algorithms detect and localize different objects (e.g., airplanes, cars and building) in a single image. The image can be optical or non-optical, such as Synthetic Aperture Radar (SAR). Recent work also suggests an increasing focus on object detection from street view imagery. Objects can be labeled with two forms of bounding boxes, i.e., oriented and horizontal bounding boxes. The geometry of a bounding box can be expressed using top-left/bottom-right coordinates, coordinates of four corners, or center coordinates along with the length and width of the box.
Semantic segmentation. In terms of Land cover (LC) and land use (LU) classification, this process assigns a LC/LU class label to a pixel (or groups of pixels) of RS imagery. Considering semantic segmentation of 3D point clouds, it is to classify points of a 3D point cloud into categories. TDs are usually composed of RS images/point clouds, and the corresponding labeled value of each pixel/point recording its class.
Change detection. These algorithms identify the difference between images acquired over the same geographical area but taken at different times. The TD comprise a set of pre-change and post-change RS images, with the corresponding ground truth map labeled changed and unchanged pixels. The image can be optical or SAR images.
3D model reconstruction. These algorithms infer the 3D geometry and structure of objects and scenes, mainly realized from the dense matching of multi-view images. The TD are usually composed of two-view or multi-view images, with the corresponding disparity map or depth maps as ground truth respectively.
The TrainingDML-AI conceptual model provides models for the most important elements within TD. These elements have been identified to be either required or important in many different AI/ML tasks. However, implementations are not required to support the complete TrainingDML-AI model in order to be conformant to the Standard. Implementations may employ a subset of constructs according to their specific information needs. For this purpose, modularization is applied to the TrainingDML-AI.