Publication Date: 2019-12-20

Approval Date: 2019-11-22

Submission Date: 2019-09-30

Reference number of this document: OGC 19-027r2

Reference URL for this document: http://www.opengis.net/doc/PER/t15-D002

Category: OGC Public Engineering Report

Editor: Sam Meek

Title: OGC Testbed-15: Machine Learning Engineering Report


OGC Public Engineering Report

COPYRIGHT

Copyright © 2019 Open Geospatial Consortium. To obtain additional rights of use, visit http://www.opengeospatial.org/

WARNING

This document is not an OGC Standard. This document is an OGC Public Engineering Report created as a deliverable in an OGC Interoperability Initiative and is not an official position of the OGC membership. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an OGC Standard. Further, any OGC Public Engineering Report should not be referenced as required or mandatory technology in procurements. However, the discussions in this document could very well lead to the definition of an OGC Standard.

LICENSE AGREEMENT

Permission is hereby granted by the Open Geospatial Consortium, ("Licensor"), free of charge and subject to the terms set forth below, to any person obtaining a copy of this Intellectual Property and any associated documentation, to deal in the Intellectual Property without restriction (except as set forth below), including without limitation the rights to implement, use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Intellectual Property, and to permit persons to whom the Intellectual Property is furnished to do so, provided that all copyright notices on the intellectual property are retained intact and that each person to whom the Intellectual Property is furnished agrees to the terms of this Agreement.

If you modify the Intellectual Property, all copies of the modified Intellectual Property must include, in addition to the above copyright notice, a notice that the Intellectual Property includes modifications that have not been approved or adopted by LICENSOR.

THIS LICENSE IS A COPYRIGHT LICENSE ONLY, AND DOES NOT CONVEY ANY RIGHTS UNDER ANY PATENTS THAT MAY BE IN FORCE ANYWHERE IN THE WORLD. THE INTELLECTUAL PROPERTY IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE DO NOT WARRANT THAT THE FUNCTIONS CONTAINED IN THE INTELLECTUAL PROPERTY WILL MEET YOUR REQUIREMENTS OR THAT THE OPERATION OF THE INTELLECTUAL PROPERTY WILL BE UNINTERRUPTED OR ERROR FREE. ANY USE OF THE INTELLECTUAL PROPERTY SHALL BE MADE ENTIRELY AT THE USER’S OWN RISK. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR ANY CONTRIBUTOR OF INTELLECTUAL PROPERTY RIGHTS TO THE INTELLECTUAL PROPERTY BE LIABLE FOR ANY CLAIM, OR ANY DIRECT, SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM ANY ALLEGED INFRINGEMENT OR ANY LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR UNDER ANY OTHER LEGAL THEORY, ARISING OUT OF OR IN CONNECTION WITH THE IMPLEMENTATION, USE, COMMERCIALIZATION OR PERFORMANCE OF THIS INTELLECTUAL PROPERTY.

This license is effective until terminated. You may terminate it at any time by destroying the Intellectual Property together with all copies in any form. The license will also terminate if you fail to comply with any term or condition of this Agreement. Except as provided in the following sentence, no such termination of this license shall require the termination of any third party end-user sublicense to the Intellectual Property which is in force as of the date of notice of such termination. In addition, should the Intellectual Property, or the operation of the Intellectual Property, infringe, or in LICENSOR’s sole opinion be likely to infringe, any patent, copyright, trademark or other right of a third party, you agree that LICENSOR, in its sole discretion, may terminate this license without any compensation or liability to you, your licensees or any other party. You agree upon termination of any kind to destroy or cause to be destroyed the Intellectual Property together with all copies in any form, whether held by you or by any third party.

Except as contained in this notice, the name of LICENSOR or of any other holder of a copyright in all or part of the Intellectual Property shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Intellectual Property without prior written authorization of LICENSOR or such copyright holder. LICENSOR is and shall at all times be the sole entity that may authorize you or any third party to use certification marks, trademarks or other special designations to indicate compliance with any LICENSOR standards or specifications.

This Agreement is governed by the laws of the Commonwealth of Massachusetts. The application to this Agreement of the United Nations Convention on Contracts for the International Sale of Goods is hereby expressly excluded. In the event any provision of this Agreement shall be deemed unenforceable, void or invalid, such provision shall be modified so as to make it valid and enforceable, and as so modified the entire Agreement shall remain in full force and effect. No decision, action or inaction by LICENSOR shall be construed to be a waiver of any rights or remedies available to it.

None of the Intellectual Property or underlying information or technology may be downloaded or otherwise exported or reexported in violation of U.S. export laws and regulations. In addition, you are responsible for complying with any local laws in your jurisdiction which may impact your right to import, export or use the Intellectual Property, and you represent that you have complied with any regulations or registration procedures required by applicable law to make this license enforceable.

Table of Contents

1. Subject

The Machine Learning (ML) Engineering Report (ER) documents the results of the ML thread in OGC Testbed-15. This thread explores the ability of ML to interact with and use OGC web standards in the context of natural resources applications. The thread includes five scenarios utilizing seven ML models in a solution architecture that includes implementations of the OGC Web Processing Service (WPS), Web Feature Service (WFS) and Catalogue Service for the Web (CSW) standards. This ER includes thorough investigation and documentation of the experiences of the thread participants.

2. Executive Summary

This OGC ER documents work completed in the OGC Testbed-15 ML thread. This includes documentation of experimental methods and results as well as addressing the integration of ML models and outputs into an OGC Web Services (OWS) architecture. The thread covered several scenarios that have commonalities, but do not interact directly. The purpose of the research in the ML thread was to demonstrate the use of OGC standards in the ML domain through scenario development. The scenarios used in the ML thread were:

  • Petawawa Super Site Research forest change prediction model.

  • New Brunswick forest supply management decision maker ML models.

  • Quebec Lake river differentiation ML models.

  • Richelieu River Hydro linked data harvest models.

  • Arctic web services discovery ML model.

Each scenario has a set of supporting data coupled with cataloging and processing services to support the aim. An ML model is at the core of each scenario. The objective was to have the model make key decisions that a human in the system would typically make under normal circumstances. Each scenario and corresponding implementations were supported by at least one client to demonstrate the execution and parsing of outputs for visualization.

Publication of specific ML results in the draft Map Markup Language (MapML) specification focuses on the client supporting the Quebec Lake scenario as the data service. This was an implementation of the OGC Application Programming Interface (API) - Features standard. This implementation was required to produce the outputs of the model in MapML. (Note: The OGC API - Features standard was previously named WFS 3.0.) Likewise, the corresponding client was required to parse and visualize the results using the MapML outputs from the data service. This client was provided as a separate work item. The other scenarios were supported by clients provided by the model originators to demonstrate their work. A full exploration and documentation of the MapML work is documented in the MapML ER.

Each of the different work activities incorporated one or more ML techniques using different datasets and parameters. The overall findings and recommendations from the ML thread consisted of: Those regarding ML and those concerning the usage of OGC standards in ML use cases. Many of the ML recommendations included further exploration of the techniques required to produce suitable results. Recommendations of interest to the OGC are as follows:

  • Define and discuss the candidate OGC API - Processing pattern for use in machine learning. This type of exercise has already been done in the OGC Open Routing API Pilot in which two different patterns were created to explore the functionality. These were:

    • Use of a lightweight concept routes as the path base with little constraint on the API design pattern and use of conformance classes to configure clients automatically.

    • A formal structure, based on the OGC API – Processes draft specification, for paths that start with /processes/ and has many of the same API calls as WPS 2.0.

  • Understand the utility of OGC standards for feeding dynamic data to ML models. As these models require considerable data to train, the thread participants felt that the current suite of OGC standards for data dissemination is better suited for static or mostly static datasets. Extensions specific for data streaming might be useful for all big data problems, not just ML.

  • Explore the use of OGC standards to compare scenarios in previously trained ML models. There are already a number of pre-trained models freely available as well as general feature models that attempt to identify trends, patterns or objects from a variety of domains. Re-use of existing models is likely to be important in the future of geospatial ML applications.

  • • Use OGC standards to enable stress testing of ML models. The use of parameters within ML processes is key to their ability to successfully predict based upon an unknown sample. Currently this testing is carried out manually. However, stress testing via the OGC API - Processes draft specification and then recording the parameters in a CSW would be useful in the future for OGC standards to support. This approach strays into the realm of metadata profiling for ML models, which may be a useful output of future endeavors that have a discovery aspect.

Overall, the thread produced a multitude of results that can be taken forward in future OGC Testbeds and Pilots or more widely in the community.

2.1. Document contributor contact points

All questions regarding this document should be directed to the editor or the contributors:

Contacts

Name Organization Role

Sam Meek

Helyx SIS

Editor

Tom Landry

CRIM

Contributor

Pierre-Luc Saint-Charles

CRIM

Contributor

Francis Charette-Migneault

CRIM

Contributor

Mario Beaulieu

CRIM

Contributor

Ignacio Correas

Skymantics

Contributor

William Cross

Skymantics

Contributor

Jerome St-Louis

Ecere

Contributor

2.2. Foreword

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. The Open Geospatial Consortium shall not be held responsible for identifying any or all such patent rights.

Recipients of this document are requested to submit, with their comments, notification of any relevant patent claims or other intellectual property rights of which they may be aware that might be infringed by any implementation of the standard set forth in this document, and to provide supporting documentation.

3. References

4. Terms and definitions

For the purposes of this report, the definitions specified in Clause 4 of the OWS Common Implementation Standard (OGC 06-121r9) shall apply. In addition, the following terms and definitions apply:

● overfitting

The production of an analysis that corresponds too closely or exactly to a particular set of data, and therefore fails to fit additional data or predict future observations reliably. Source: Oxford English Dictionary

● dropout

The procedure of randomly dropping components of a neural network from a neural network layer. This results in a scenario where at each layer more neurons are forced to learn the multiple characteristics of the neural network. This can prevent overfitting. Source: medium.com

● activation function

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard computer chip circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. Source: Wikipedia

● hyperparameter

In Bayesian statistics, a hyperparameter is a parameter of a prior distribution. The term is used to distinguish them from parameters of the model for the underlying system under analysis. Source: Wikipedia

4.1. Abbreviated terms

  • ADES - Application Deployment and Execution System

  • AI - Artificial Intelligence

  • API - Application Programming Interface

  • AUPRC - Area Under Precision Recall Curve

  • CLI - Command Line Interface

  • CNN - Convolutional Neural Networks

  • CRIM - Computer Research Institute of Montréal

  • CRS - Coordinate Reference System

  • CSW - Catalogue Service for the Web

  • CVM - Controlled Vocabulary Manager

  • CWL - Common Workflow Language

  • DL - Deep Learning

  • DNN - Deep Neural Network

  • DVC - Data Version Control

  • ER - Engineering Report

  • EMS - Execution Management System

  • GRHQ - Géobase du réseau hydrographique du Québec

  • Helyx SIS - Helyx Secure Information Systems Limited

  • HLS - Harmonized Landsat and Sentinel-2

  • HRDEM - High Resolution Digital Elevation Model

  • HTTP - Hypertext Transfer Protocol

  • JSON - JavaScript Object Notation

  • KB - Knowledge Base

  • LiDAR - Light Detection and Ranging

  • mAP - Mean Average Precision

  • MapML - Map Markup Language

  • ML - Machine Learning

  • MLP - Multilayer Perceptron

  • NIR - Near-Infrared

  • NLTK - Natural Language Toolkit

  • OGC - Open Geospatial Consortium

  • ONNX - Open Neural Network Exchange Format

  • OpenMI - Open Modeling Interface

  • OWS - OGC Web Services

  • PaaS - Platform as a Service

  • Pub/Sub - Publication/Subscription

  • RAKE - Rapid Automatic Keywork Extraction

  • RDF - Resource Description Framework

  • REST - Representational State Transfer

  • RGB - Red, Green, Blue

  • RL - Reinforcement Learning

  • RNN - Recurrent Neural Network

  • SPF - Spruce, Pine, Fir

  • SOS - Sensor Observation Service

  • TC - Technical Committee

  • TF/IDF - Term Frequency-Inverse Document Frequency

  • TIE - Technology Integration Experiments

  • URL - Uniform Resource Locator

  • VCS - Version Control Systems

  • WCS - Web Coverage Service

  • WES - Web Enterprise Suite

  • WFS - Web Feature Services

  • WICS - Web Image Classification Service

  • WMS - Web Map Service

  • WPS - Web Processing Service

  • WPS-T - Transactional Web Processing Service

5. Overview

The rest of the ER is organized as follows:

Chapter 6 provides an overview of previous ML work in the OGC and an overview of the work items.

Chapter 7 describes the thread architecture.

Chapter 8 describes the component implementation that provides the Petawawa cloud mosaicking model capability.

Chapter 9 describes the component implementation that provides the Petawawa land classification model capability.

Chapter 10 describes the component implementation that provides the New Brunswick forest supply management decision maker ML model capability.

Chapter 11 describes the Quebec River-Lake Classification and Vectorization ML model capability.

Chapter 12 describes the Arctic Discovery catalog.

Chapter 13 provides the overall discussion and recommendations from the work.

Chapter 14 provides the concluding remarks.

6. Background

This OGC Engineering Report (ER) reports on the work performed and completed as part of the Machine Learning (ML) thread in the OGC Testbed-15 initiative. ML has previously been explored in the OGC through the ML thread in Testbed-14. While the work reported in this ER is not a direct continuation from Testbed-14, the Testbed-14 Machine Learning ER provides many of the recommendations and design influences leading to the work described in this ER. A major driving factor behind this ER is a movement towards standardization of an interface designed for interacting with ML models and processes.

The documents reviewed are largely from the OGC, but academic and industrial references are included where relevant.

Previous OGC work that has influenced the Testbed-15 ML activity consists of the following documents:

  • 18-038r2 - OGC Testbed-14: Machine Learning Engineering Report

  • 18-094r1 - OGC Testbed-14: Characterization of RDF Application Profiles for Simple Linked Data Application and Complex Analytic Applications Engineering Report

  • 18-097 - OGC Environmental Linked Features Interoperability Experiment Engineering Report

  • 18-022r1 - OGC Testbed-14: SWIM Information Registry Engineering Report

  • 18-090r1 - OGC Testbed-14 Federated Clouds Engineering Report

  • 16-059 - Testbed-12 Semantic Portrayal, Registry and Mediation Engineering Report

  • 15-054 - Testbed-11 Implementing Linked Data and Semantically Enabling OGC Services Engineering Report

  • 14-049 - Testbed 10 Cross Community Interoperability (CCI) Ontology Engineering Report

  • 19-003 - OGC Testbed: Earth System Grid Federation (ESGF) Compute Challenge

  • 18-050r1 - OGC Testbed-14: ADES & EMS Results and Best Practices Engineering Report

  • 18-049r1 - OGC Testbed-14: Application Package Engineering Report

  • 17-035 - OGC Testbed-13: Cloud ER

The earliest example of ML-type operations being exposed via an OGC interface is via Web Image Classification Service (WICS). This service includes several calls that are suitable for configuring and executing ML models behind an OGC interface. Specifically, these calls include: GetClassification, TrainClassifier and DescribeClassifier. Although suitable for use in a small set of circumstances, the WICS only supports image-specific calls. It does this through OGC web services style applications that represent an older architecture model, prior to the recent move to a resources-based model through OpenAPI. ML in the Testbed-15 context has broadened to include different types of ML beyond image classification. The work in this Testbed moves towards a decision support tool that utilizes multiple data types to build models and predict results.

The OGC Testbed-14 ML ER describes work that extends beyond WICS. It identifies and implements several new calls that follow a similar pattern to WICS, but go beyond image classification. These calls are as follows:

  • TrainML

  • RetrainML

  • ExecuteML

These three calls follow the web services pattern of OGC services and offer the ability to create, modify and execute ML models through a standardized interface. In addition to these calls are the following ML Knowledge Base (KB) interactions:

  • GetModels

  • GetImages

  • GetFeatures

  • GetMetadata

As well as opening up the interfaces to include ML specific calls, the ML space has undergone semantic enablement via a Controlled Vocabulary Manager (CVM). The interfaces used in Testbed-14 consisted mainly of WPS-T 2.0 with Representational State Transfer (REST) and JavaScript Object Notation (JSON) bindings. There is an ongoing initiative within the OGC Technical Committee (TC) to enhance Web Processing Service (WPS) version 2.0 to 3.0 by implementing the REST/JSON bindings as core functionality rather than as an extension. This is designed to bring WPS in line with the OGC API - Features standard, which is also based on OpenAPI. Additionally, Testbed-14 brought about experimental implementations of Web Map Service (WMS), Web Feature Service (WFS) and WPS standards specifically for transparency and usage of ML models.

There are several explicit recommendations taken from the Testbed-14 ML ER, garnered from experiences of ML in OGC Web Services (OWS). These include:

  • Use of a Catalogue Service for the Web (CSW) as an interface to an ML KB.

  • An International Organization for Standardization (ISO) application profile to record and distribute KB Information

  • Use of the OGC API - Features standard as a design pattern to manage the interaction with ML capabilities.

  • Consideration for use of the Open Modeling Interface (OpenMI) standard.

Moving on from interfaces and service standards, a further area enabling ML discussed in Testbed-14 is the concept of Federated Clouds, that is, disparate cloud computing services that are federated to share access credentials and therefore services, data and resources, are likely to play a role in the ML space. Federation of cloud services is not a new concept and is simply managed when cloud services (or any services for that matter) sit within the same administrative domain. Recently there has been a shift in the computing world to assume resources-on-demand, including elastic computational storage and computing power, that can be surged or stood down as required. All of this is usually outsourced in a Platform-as-a-Service (PaaS) approach, where on-demand computing is provided by an organization with significant resources (server farms) that are allocated according to demand and provision. In short, federation enables participating organizations to selectively share information across administrative domains for purposes of their choosing.

The Testbed-14 Federated Clouds ER sought to research and test the implications of utilizing federated cloud architecture with a focus on cross administrative domain security for use case such as data sharing. In the case of ML and in particular the ML thread in this Testbed, components including ML models and clients are designed and maintained by different vendors (as in the real world) but all need to interoperate. Therefore, understanding and applying the lessons learned and recommendations from Testbed-14 Federated Cloud ER as needed is paramount to a successful set of interoperability experiments.

ML models, outputs and predictions are complex, therefore cataloging, presenting and disseminating data and metadata is of high importance. There are multiple languages, models and standards that can be used to document, discover and disseminate complex offerings utilizing OGC standards. In Testbed-14, dissemination of complex analytic applications and data was explored in the OGC Testbed-14 Characterization of RDF Application Profiles for Simple Linked Data Applications and Complex Analytic Applications ER. The ER covers several aspects of interest including Resource Description Framework (RDF) profiles, Web Ontology Language (OWL) and ontologies to describe certain aspects of complex analytical use cases. RDF is of particular consideration as Testbed-14 sought to define a metadata model to describe RDF application profiles. If operationalized, this was of tangible use within the ML thread as it provides a facility to discover application profiles based upon specific ontologies. The work documented in this ER seeks to utilize RDF where suitable to enable discovery of the complex analytical applications.

The architecture of the thread consists of a set of well-defined ML scenarios. The requirements across the thread deliverables are broad enough to cover typical ML usage such as analysis of imagery content through to the discovery of ML datasets, models and practices through the Arctic Discovery catalog. The latter, activity is concerned with ML process metadata rather than just the outputs. There are a set of stretch goals within the Call For Participation (CFP) that are also discussed and prioritized according to likely value gain for the sponsors.

In terms of interfaces, each component is fronted by the relevant OGC service. Each of the ML models is fronted by a WPS, either 1.0 or with the REST/JSON bindings and the catalog is an OGC Compliant CSW (there is currently further work going on in OGC to define an OGC API specification for catalogs). Additionally, data created by the ML models are exposed by the relevant data interface, OGC API - Features for features and Web Coverage Service (WCS) for coverages.

6.1. Relationship to OGC API - Processes (WPS 3)

Note
WPS

Prior to the OGC API - Processes naming convention, the draft specification was referred to as WPS 3.0. The official name of the draft specification is now OGC API - Processes. This ER therefore, at times, acceptably refers to implementations of OGC API - Processes as WPS.

As mentioned previously, there are several commonalities between each of the scenarios in terms of the requirements. The scenarios are separate as they aim to deliver completely different outputs, are focused on different areas and in some cases, are using different approaches to ML to achieve their goals. However, each of the server-side components are required to be fronted by a WPS (with REST bindings in some instances) and each has the option of utilizing Common Workflow Language (CWL). Note that since the WPS implementations described in this ER conform to the draft OGC API – Processes specification, they are referred to using both terms throughout this document.

At the time of writing, there is a debate within the OGC on how processing services should be exposed using OpenAPI fronted, resource-based architectures. Some of the viewpoints are captured in the OGC API Hackathon 2019 Engineering Report (OGC 19-062) which presents results from the OGC API Hackathon 2019 event. The debate is largely concerned with the role of legacy WPS calls in versions 1, 2, and transactional versions that include:

  • From WPS 1.0

    • GetCapabilities - provide the capabilities document describing the processes available

    • DescribeProcess - describe a particular process

    • Execute - execute a process

  • Introduced in WPS 2.0

    • GetStatus - provide the status of an asynchronous processes

    • GetResult - provide the result of an asynchronous process

  • Introduced in WPS-T

    • DeployProcess - deploy a new process ready for Execution

    • UnDeployProcess - undeploy a deployed process so it is no longer available

These calls provide functionality in a web services architecture that performs specific actions in relation to processing. In the resource-based architecture approach, the calls are based upon the HTTP verbs GET, POST, HEAD, PUT and DELETE.

6.2. Machine Learning Techniques

The terms "artificial intelligence" and "machine learning" are often used interchangeably or at minimum in a hyphenated fashion. In truth, ML can be considered as a subset of Artificial Intelligence (AI) techniques. Additionally, ML as an array of techniques contains a multitude of different algorithms that are selected to produce the best result depending on the use case. Related to the generic concept of ML is Deep Learning (DL), which is a subset of ML that uses large, multi-layered, artificial neural networks for supervised or unsupervised ML problems.

This section contains a short overview of the techniques used in this thread. While there are several nuanced differences, the main one to consider is the automation of model feedback.

In addition to this functionality, there are several non-functional requirements including:

  • Use of TensorFlow

  • Continuing to work on CWL best practices from previous Testbeds.

  • The demonstrator should be compatible with the Boreal Cloud OpenStack cloud environment of Natural Resources Canada (NRCan). Boreal Cloud is NRCan’s high performance cloud infrastructure based on OpenStack technology, located at the Pacific Forestry Centre in Victoria, BC.

Supervision of a classification application depends on how much human intervention is required to achieve a suitable model for prediction. Supervised Learning requires human intervention to different degrees depending on the use case. Unsupervised Learning does not require any human interaction while training the models as the ML model uses automated techniques to assess the likely performance of the model.

6.2.1. Reinforcement Learning

This type of learning is usually implemented in game play applications and in use cases that include autonomous vehicle navigation as there is no "correct" answer to a particular problem. Instead the ML model looks to make the best decision given the circumstances with a view to maximizing cumulative reward. The reinforcement aspect is the application of the reward within the system, if cumulative reward increases then the system has a notion of a good decision and will seek to perform similar actions to further increase reward.

6.2.2. Convolutional Neural Networks

A Convolutional Neural Network (CNN) uses convolutions to extract features from local regions of an input. CNNs have gained popularity particularly through their excellent performance on visual recognition tasks. CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage. They have applications in image and video recognition, recommender systems, image classification, medical image analysis, and natural language processing.

6.2.3. Recurrent Neural Networks

Recurrent Neural Networks (RNN) are a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows the model to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

7. Thread Architecture

The ML thread is comprised of a set of five scenarios with eight formal deliverables and several clients provided by vendors in kind. As there are five separate scenarios, the thread participants defined five separate architectures that were utilized to demonstrate the interoperability of components through the interoperability testing process, also known as Technology Integration Experiments (TIEs). This section describes the scenarios and supporting architectures in detail to provide the reader with an overview of the thread goals, architectures for each of the threads, a motivation for each of the threads, and any changes made during the course of the Testbed to mitigate issues experienced. The five scenarios are as follows:

  • Petawawa Super Site Research Forest Change Prediction ML Model

  • New Brunswick forest supply management decision maker ML model

  • Quebec lake-river differentiation model

  • Richelieu River linked data harvest model

  • Arctic web services discovery ML models

There are a common set of technical requirements for each of the scenarios. As with all Testbeds, one of the goals is to utilize the latest versions of OGC standards This policy was operationalized in this domain via usage of the OGC API - Processes draft specification. Likewise, data services are made available through OGC API - Features implementations using the new OpenAPI style resource-based approach and WMS/WCS for mapping and coverages. Although not an enforced requirement, it is expected that each of the ML models is built using open source software with a mention of TensorFlow. Each of these scenarios is discussed in turn in the following sections. Many of the ML participants opted to use WPS 2.0 because they had existing operational implementations.

7.1. Petawawa Super Site research forest change prediction ML model scenario

5 petawawa
Figure 1. Petawawa Super Site forest change architecture

The aim of this component deliverable was to 1) produce an ML model for detecting and removing high altitude cloudlets (popcorn clouds) from Landsat 1 data in the Petawawa Super Site, and 2) produce a second model for classifying a cloudless, automatically generated image mosaic into land cover categories. The ML model performed the following functions:

  1. Data discovery using an OGC CSW.

  2. Discovery of usable imagery that has less than 70% cloud coverage.

  3. Identification of parts of an image that are either cloud or cloud shadow.

  4. Creation of a cloud free composite image using automated techniques.

  5. Classify the resultant composite image into land cover using a second ML model.

In addition to this functionality, there are several non-functional requirements including:

  • Use of TensorFlow

  • Continuing to work on CWL best practices from previous Testbeds. CWL could potentially be used to automate some of the test workflows or to enable the discovery to dissemination aspect of the system. Although a non-functional requirement, the implementation aspect is optional.

  • Ensuring demonstrator compatibility with the NRCan Boreal Cloud OpenStack environment.

This scenario is designed to exercise two ML models that are made available to a single client. These include:

  • A cloud and cloud shadow (artifact) identification model.

  • A land cover classification model.

These two ML models form the backbone of the ML thread; however, they are supported by the following services:

  • Each ML model is fronted by a WPS 2.0 for simple execution of the services exposed by the models.

  • A CSW facilitates discovery of time-series enabled satellite imagery from Landsat and Sentinel-2 products.

  • Attached to the cloud artifact identification model that creates a mosaic using multiple images to build a cloud-free composite.

  • Results are made available via the relevant interface (WFS 3.0, WMS, WCS).

Additionally, there is a requirement to continue the work done in Testbed-14 to utilize the CWL to potentially automate some test workflows or to enable the discovery → ML1 → ML2 → dissemination aspect of the system via pre-configuration. The CWL aspects of the thread are optional and implemented where specified.

7.2. New Brunswick forest supply management decision maker ML model scenario

5 new brunswick
Figure 2. New Brunswick forest supply management scenario architecture

This ML model was concerned with the efficient routing of timber from a managed woodland area in New Brunswick. Road building and infrastructure management were also considered. This model was atypical in terms of its usage of ML practices. It performed the following functions while working towards similar non-functional requirements as described in the previous section:

  1. Create a "wood flow model" to optimize routing for timber from source to market.

  2. Recommend areas for new road construction to make the route more efficient.

  3. Provide a list of recommended road closure locations and times to minimize disruption.

  4. Consider data from different sources including: primary infrastructure, secondary infrastructure, and prices of lumber, fuel and energy.

As mentioned previously, the scenarios in this thread were distinct and therefore treated as their own work-item sets, rather than one large, interoperable thread. The New Brunswick scenario contains many of the same constraints and requirements as the other scenarios, such as using a WPS instance to front the model, a CSW for data discovery and cataloging, and WCS/WFS/WMS for data dissemination. The ML model in this work package was complex and consisted of a set of ML models to achieve the desired outcome. The ML model aspects of this work package were as follows:

  • Creation of a wood flow model, that is, optimization of resource allocation considering optimized flow from forest to market.

  • Recommendation of new infrastructure including roads and bridges to further optimize wood flow considering life-cycle analysis.

  • Utilization of peripheral supporting information including market prices of lumber, secondary infrastructure, primary infrastructure and efficiency.

  • Deployment of the capability on the NRCan Boreal Cloud OpenStack environment.

Execution of the workflow is somewhat simpler than the Petawawa scenario as the ML service can be configured and executed without reaching back to client at any point, except when providing the result.

7.3. Quebec Lake river differentiation ML model scenario

5 quebec lake
Figure 3. Quebec Lake river differentiation model architecture

The objective of this work package was to create and deploy an ML model to differentiate between rivers and lakes from otherwise unlabeled bodies of water in an image. The main focus of the work was to provide a service to determine whether a body of water should be split into a lake and a river. If so, then the lake and river portions of the split should be identified and labeled. If no split is required, then each identified body of water should be labeled as either lake or river. The procedure for applying the model is as follows:

  1. Recommend whether a water body should be split into lake and river features.

  2. Evaluate the confidence level of a recommendation.

  3. Apply the recommendation to the dataset.

  4. Test and correct the resultant dataset for topological and cartographical issues.

  5. Present the data in a WFS 3.0 using MapML (described in another ER).

This scenario requires an ML model that is capable of differentiating between lakes and rivers from imagery and Light Detection and Ranging (LiDAR) data. Currently bodies of water from these datasets can be distinguished, but there is not a clear indication of where the line is drawn between when a water body changes from a lake to a river and vice versa. This is not just an ML problem but also an ontological problem. Therefore, any definition of the two concepts is built upon a somewhat arbitrary definition, although a consistent one if an ML service is to be successful. In addition to identifying rivers and lakes, the entire ML service needed to perform the following functions in a workflow:

  • Identify a water body and recommend whether a split needs to be made and apply a confidence level to the recommendation.

  • If a split is made then vectorize the bodies of water into lakes and rivers.

  • Apply topological correction algorithms if required to remove errors including:

    • Overlaps

    • Slivers

    • Gaps

  • Name each feature according to a suitable naming convention as not all rivers and lakes have accessible names.

  • Serve the results via MapML using WFS 3.0.

Unlike the previous work packages, there is no data discovery requirement via a CSW. However, there is a requirement to serve the results via an implementation of OGC API – Features, using MapML.

7.4. Richelieu River hydro linked data harvest model scenario

5 richelieu river
Figure 4. Richelieu River linked data harvesting scenario architecture

This work package differs from the others as it does not require imagery or ML in the traditional sense. Instead this scenario seeks to mine the semantic web for relevant relations between datasets and store the results as triples in the appropriate database. The model was based upon a set of provided ontologies for features and relations to be harvested by the ML model. This scenario was concerned with establishing links between datasets via the semantic web. The main work item in this work package was the AI tool triple generator, which sought to harvest data from specific datasets and gather relations between items of data. The details regarding the semantic aspects of this work package are described in the OGC Testbed-15: Semantic Web Link Builder and Triple Generator Engineering Report (OGC 19-021) and the ML aspects are described in the Components section of this ER.

7.5. Arctic Web Services Discovery ML model scenario

5 arctic
Figure 5. Arctic Web Service discovery model ML architecture

The goal for this work package was to understand the data holdings of a particular domain and its utility to the Arctic domain in terms of relevance to circumpolar science. The following structure was used for this approach:

  • The model was focused on the .ca domain to understand the assets that are available within this domain and their relationship to other data assets.

  • The ML model was trained to cycle through and categorize endpoints on the .ca domain and make a decision on whether each has any relevance to circumpolar science.

  • The identified datasets were given a confidence score and then entered into a CSW for later discovery and use.

The concept of relevance can be determined in a variety of ways. For example, a geographical bounding box can be used as a geofence but the model may also rely on keyword search as well as other parameterization options. Essentially the ML aspect of the service was trained on a set of attributes of a test ML service that was deemed to be relevant. It then crawled through all ESRI REST endpoints and OGC services within the domain and made an assessment of each of the services, providing information on their relevance.

8. Petawawa cloud mosaicking ML model

In the context of the Petawawa Super Site research forest change prediction ML model, the Testbed-15 D100 component (i.e. cloud mosaicking ML model) aimed to create a cloud-free mosaic over the Petawawa Research Forest by assembling the best non-cloud and most recent segments over a given time frame. The cloud detection system was based on ML and CNN.

5 petawawa
Figure 6. Petawawa Super Site research forest change prediction ML model
Note
Petawawa Research Forest

The 100 km2 Petawawa Research Forest is situated in Ontario, approximately two-hours northwest of Canada’s capital city, Ottawa. Located in the mixedwood forests of the Great Lakes–St. Lawrence Forest region, common tree species include white pine (Pinus strobus L.), trembling aspen (Populus tremuloides Michx.), red oak (Quercus rubra L.), red pine (P. resinosa Ait.), white birch (Betula papyrifera), maple (Acer spp.), and white spruce (Picea glauca), among others (Wetzel et al. 2011). This forest region is considered a transition between the boreal forests to the north, which are dominated by coniferous species, and the deciduous-dominated forests to the south.

The whole system was developed and deployed to be compatible with NRCan’s Boreal Cloud (OpenStack cloud environment). The model can be accessed via a generic WPS client here. The mosaic is generated starting from surface reflectance products available from NRCan’s National Forest Information System for the following datasets:

Table 1. Datasets
Dataset Description

Landsat

Archived Landsat Collection 1 data (1972–2018). Includes Landsat Multispectral Scanner (MSS), Thematic Mapper (TM), Enhanced Thematic Mapper Plus (ETM+), and the Operational Land Imager (OLI). With the exception of MSS, all data is corrected to surface reflectance. Search terms: “PRF” AND “Landsat”, “Landsat4”, “Landsat5”, “MSS”, “TM”, “ETM+”, “OLI”, etc.

Sentinel-2

Archived Sentinel-2 data (2016–2018), corrected to surface reflectance.

Harmonized Landsat and Sentinel-2 (HLS)

Harmonized Landsat and Sentinel-2 surface reflectance data generated by NASA/USGS (2013–2018)

Warning
Landsat availability

With respect to the Landsat Dataset defined in Table 1, for this component the search is based only on Landsat 7 ETM+. In any case, the MSS products cannot be used due to the missing Blue band.

8.1. Component Summary

The following figure summarizes the main software constituting the component.

6a Components architecture
Figure 7. D100 High level architecture

The D100 component design is based on four main elements:

  • WPS Server

  • Job / Queue Handler

  • Internal Storage

  • Orchestrator

Warning
Architecture Deltas

At the time of publication of this ER, the D100 component is based on WPS version 1.0 and not WPS 3.0 as stated in the architecture. This was a stretch goal for this work package. Likewise, the implementation does not use CWL.

8.1.1. WPS Server

The D100 component exposes a dedicated WPS enabled server (implementing the WPS 1.0 standard) for the requests. The WPS server is running on Flask, a Python lightweight Web Server Gateway Interface (WSGI) web application framework using PyWPS. The endpoint handles the following four different requests:

  • Network training;

  • Cloud free mosaic generation;

  • Cloud free mosaic generation status query;

  • Cloud free mosaic download.

Considering that both network training and mosaic generation are demanding activities, these types of requests are queued in order to avoid blocking the WPS server. All the other requests are immediately served. For the cloud-free mosaic generation, requesting either one of the two defaults: ready network (one trained with 3 bands for Red-Green-Blue (RGB) and one with 4 bands for RGB + Near Infrared), or a new network trained (either 3 or 4 bands) with a dedicated WPS request is possible.

8.1.2. Job / Queue Handler

The queue mechanism relies on Remote Dictionary Server (Redis), an in-memory data structure and object persisting system supporting different kinds of abstract data structures. Every time the WPS server receives the training and mosaic generation requests , the new request is pushed on a Redis Queue (RQ), a Python library for queuing jobs and processing them in the background with so called "workers". The relevant job is not automatically run and its execution is remanded to RQ. A worker is another Python process running in the background as a work-horse to perform lengthy or blocking tasks instead of performing the task inside a web process. At least one worker is always up-and-running, but more than a single request instantiating more workers as needed according to application loading and hardware resources available (default configuration is 5 workers) can be served. Every time a worker is available, the job is retrieved from the RQ in a First In - First Out (FIFO) order and executed in a new dedicated process.

8.1.3. Internal Storage

The Internal Storage component contains two different kinds of data: Trained network models and, inside what is called Workspace, all downloaded bands, tiles, intermediate cloud masks generated by the ML model, and the final mosaic for each request. By default, two models are present (as stated before one trained with 3 bands for RGB and one with 4 bands for RGB + Near Infrared). Any custom training requested by dedicated WPS call is also stored to be called later. All the models (stored as PyTorch checkpoints) are persistent in time. The Workspace content, instead, is preserved for each job only for a specific retention time. When time expires the specific folder is deleted to save storage.

8.1.4. Orchestrator

When a WPS request is received for cloud mosaicking, the worker runs a dedicated job named Orchestrator. This was the core part of the D100 component and was composed of several different subcomponents as follows:

  • OGC Clients

  • Bands Slicer

  • ML Model

  • Mosaic Builder

8.1.4.1. OGC Clients

Access to the catalog is required to create the mosaic. In order to search and to retrieve products bands, the Orchestrator uses OWSLib Python library for both CSW and WCS requests. For each product discovered from the catalogue service, several links are returned, one for each available band. The number of bands downloaded depends on the model requested for the cloud detection (i.e. either 3 (RGB) or 4 (RGB + Near Infrared) bands).

Table 2. Bands Number Mapping
Dataset Red Green Blue NIR Resolution (m)

Sentinel-2

4

3

2

8

10

Landsat-7

3

2

1

4

30

HLS

4

3

2

5

30

All required bands are downloaded to a dedicated folder inside the Workspace, one for each WPS mosaic generation request. For each product found in the search, the relevant bands are downloaded via WCS and stored in the Workspace of the Internal Storage. Only when all the bands have been downloaded and just after the Bands Slicer is the ML model called.

8.1.4.2. Bands Slicer

In order to provide the ML model with proper input, all the downloaded bands are sliced into tiles of 224 x 224 pixels and marked with proper geolocation / geographic information (needed later to rebuild a single cloud mask image). This tile size was chosen to balance speed and performance in the training phase.

8.1.4.3. ML Models

The generic ML model is based on a ResNet 18 architecture developed on the PyTorch framework. The model accepts one single tile (224 x 224 pixels) composed of several bands (three or four) and generates a black and white image representing the cloud mask of the inferred data with the same size. The pure white areas represent pixels containing clouds while the black areas represent pixels where clouds are not present. Two default models were made available: One trained with three bands (RGB) and one with four bands (RGB + NIR).

8.1.4.4. Mosaic Builder

When all the tiles are processed by the ML model, the Mosaic Builder merges them into a single cloud mask. The cloud mask is used as an alpha channel to be applied to the original product bands. This result is then combined with the other cloud-free mosaics in a reverse time order. This allows cloud pixels from earlier images to be substituted for non-cloud pixels from more recent images. The final mosaic is generated in GeoTIFF RGB format.

8.2. Component Design

This section describes the overall lifecycle of the D100 component considering two main use cases covering all the functionalities:

  • Cloud-free mosaic generation;

  • ML model training / retraining.

8.2.1. Cloud free mosaic generation

The mosaic generation was triggered by a specific WPS request. The following figure shows the sequence diagram for a generic mosaic generation process.

6a Components mosaic sequence diagram
Figure 8. Cloud Free Mosaic Sequence Diagram

The D100 component receives a WPS execution request containing several input parameters (e.g. time window, ML model to be used and so on). The request is queued and waits for the first available worker to run the job. The client is notified with a response message indicating that the request was received and a new job was created with a specific Job ID. This ID is later used by the client to query the status of the request’s progress.

6a Components job status diagram
Figure 9. Job Status Diagram

Any subsequent request about Job status will return one of the following status:

  • queued: The WPS request for a mosaic generation is received and queued but has not yet started.

  • started: The Job is queued by a worker and is running.

  • failed: The Job has encountered an unexpected error and is blocked.

  • completed: The mosaic has been generated and is available for download.

  • NONE: Either the Job ID is not valid or is no longer available (retention time expired).

As soon as a worker is available, it queues a Job and runs the Orchestrator. As a first step, the Orchestrator queries the WCS server to retrieve all products covering the requested time window. The result list is then sorted in descending percentage coverage order, and all the products having cloud coverage greater than 70% are discarded. For each product found, the relevant bands are downloaded, sliced into tiles, inferred in the ML Model, and reassembled to generate a single cloud mask for the whole product. The Mosaic Builder also takes the product bands plus the cloud mask and merges one product at a time, stacking the different results respecting the descending sorting order. This process, from downloading the bands to image stacking is performed iteratively (i.e. handling one product at a time) until either the area is entirely cloud free or no more products are available. Finally, the resultant GeoTIFF RGB file is downloaded by the client.

8.2.2. ML model training

The D100 implementation was delivered using two different networks having the same architectures but coping with tiles defined by either 3 (RGB) or 4 (RGB + NIR) bands. This approach is consistent with the two process profiles described in the OGC Testbed-14 Machine Learning ER and the ML best practices work from Testbed-14. This means that training or retraining of a network can be triggered by a dedicated WPS call. Considering the nature and context of the cloud detection system, having a dynamic dataset (mainly to validate the quality and accuracy of the network) is quite complex. Instead what can be requested is training a new instance of the network (choosing 3 or 4 bands architecture), asking for specific batch size and number of epochs. This new model is then stored in the Internal Storage and can be later recalled for the generation of a cloud free mosaic.

8.3. Implementation Approach

8.3.1. Job / Queue Handler

In order to handle all the WPS execute calls, at least one running worker shall be present. Run the following command line to start a new worker.

RQ Worker start-up command
prompt> rq worker –worker-ttl -1

In order to assure that each job queued is served, the parameter -worker-ttl is set to -1 to disable expiration of the job.

Reviewing the status of the workers (i.e. how many are running and their queue status) can be achieved with the rqinfo command.

RQ Worker status command
prompt> rqinfo

default      | 0
1 queues, 0 jobs total

a24018a5cc594e9cb73779d5a8908afa (None None): ?
dde85a5880574f55853107e0899fa669 (None None): ?
db90b2fbb3e64b428d324766dff52ea0 (None None): ?
6833fe6f60fa4fe88426ca7aba88429a (None None): ?
3ee415e050a34403b4140b2d34b83f67 (None None): ?
5 workers, 1 queues

To stop the workers either kill the processes or close the prompt.

8.3.2. WPS Server

As described above, the WPS server is based on Flask and handles four different types of requests: Network training, cloud free mosaic generation, generation status query, and cloud free mosaic download. Beside these, the WPS instance exposes a generic interface such as the standard GetCapabilities.

8.3.2.1. GetCapabilities

The GetCapabilities operation requests details of the services offered by the D100 component, including service metadata and metadata describing the available processes. The response is an XML document called the capabilities document, which contains a list of all available services. An example of a GetCapabilities request is:

https://borealweb.nfis.org/tb15d100wps?
  service=WPS&
  version=1.0.0&
  request=GetCapabilities

The response is a standard WPS GetCapabilities XML response. The following is a snippet of the services offered by the D100 component:

GetCapabilities XML sample response snippet
<!-- PyWPS 4.2.1 -->
<wps:Capabilities service="WPS" version="1.0.0" xml:lang="en-CA" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsGetCapabilities_response.xsd" updateSequence="1">
  ...
  ...
  <wps:ProcessOffering>
    <wps:Process wps:processVersion="1.0.0">
      <ows:Identifier>train_network</ows:Identifier>
      <ows:Title>Train Network</ows:Title>
      <ows:Abstract>Trigger a process to train the neural network</ows:Abstract>
    </wps:Process>
    <wps:Process wps:processVersion="1.0.0">
      <ows:Identifier>compose_mosaic</ows:Identifier>
      <ows:Title>Compose Mosaic</ows:Title>
      <ows:Abstract>Trigger a process to compose a mosaic over a time range</ows:Abstract>
    </wps:Process>
    <wps:Process wps:processVersion="1.0.0">
      <ows:Identifier>get_status</ows:Identifier>
      <ows:Title>Get Status</ows:Title>
      <ows:Abstract>Retrieve the Job Status</ows:Abstract>
    </wps:Process>
    <wps:Process wps:processVersion="1.0.0">
      <ows:Identifier>get_result</ows:Identifier>
      <ows:Title>Get Result</ows:Title>
      <ows:Abstract>Retrieve the Job Result</ows:Abstract>
    </wps:Process>
  </wps:ProcessOfferings>
  ...
  ...
</wps:Capabilities>
8.3.2.2. DescribeProcess

The DescribeProcess operation requests details of any services offered by the D100 component.

An example of a DescribeProcess request for a compose_mosaic service is:

https://borealweb.nfis.org/tb15d100wps?
  service=WPS&
  version=1.0.0&
  request=DescribeProcess
  identifier=compose_mosaic

All the available parameters, their nature and possible values (e.g. model type to train or model name to infer) if constrained are provided in a standard response package. In the following sections all the available services with relevant parameters are described.

8.3.2.3. Cloud free mosaic generation

In order to trigger the generation of a new cloud free mosaic, a specific WPS execute service is exposed with the following parameters:

Table 3. Cloud free mosaic generation request parameters
Keyword Description Sample Value

identifier

The name of action to be executed. Fixed compose_mosaic

compose_mosaic

model

The name of network to be used. Default values of pretrained network are always available: RGB and RGBNIR

RGBNIR

start

The start date of the time window in ISO Date format.

2017-06-01

stop

The stop date of the time window in ISO Date format.

2017-07-01

An example of this request is:

https://borealweb.nfis.org/tb15d100wps?
  service=WPS&
  version=1.0.0&
  request=Execute&
  identifier=compose_mosaic
  datainputs=model=RGBNIR;start=2017-06-01;end=2017-07-01

If the request is accepted and queued correctly, the client is provided with the Job ID (e.g. d7111976-3a9f-401c-a3d2-c1a5c30329ac) uniquely identifying the request. This Job Id is needed to perform a status query.

Cloud free mosaic generation XML sample response
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="https://borealweb.nfis.org/tb15d100wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="">
    <wps:Process wps:processVersion="1.0.0">
        <ows:Identifier>compose_mosaic</ows:Identifier>
        <ows:Title>Compose Mosaic</ows:Title>
        <ows:Abstract>Trigger a process to compose a mosaic over a time range</ows:Abstract>
    </wps:Process>
    <wps:Status creationTime="2019-08-21T15:34:49Z">
        <wps:ProcessSucceeded>PyWPS Process Compose Mosaic finished</wps:ProcessSucceeded>
    </wps:Status>
    <wps:ProcessOutputs>
        <wps:Output>
            <ows:Identifier>jobID</ows:Identifier>
            <ows:Title>Job Identifier</ows:Title>
            <ows:Abstract></ows:Abstract>
            <wps:Data>
                <wps:LiteralData uom="urn:ogc:def:uom:OGC:1.0:unity" dataType="string">d7111976-3a9f-401c-a3d2-c1a5c30329ac</wps:LiteralData>
            </wps:Data>
        </wps:Output>
    </wps:ProcessOutputs>
</wps:ExecuteResponse>

8.3.3. Cloud free mosaic generation status query

In order to query the system about the status of a Job, a specific WPS execute service is exposed with the following parameters.

Table 4. Cloud free mosaic generation status query request parameters
Keyword Description Sample Value

identifier

The name of action to be executed. Fixed get_status

get_status

job_id

The Job ID returned by the cloud free mosaic generation XML sample response.

d7111976-3a9f-401c-a3d2-c1a5c30329ac

An example of this request is:

https://borealweb.nfis.org/tb15d100wps?
  service=WPS&
  version=1.0.0&
  request=Execute&
  identifier=get_status
  datainputs=job_id=d7111976-3a9f-401c-a3d2-c1a5c30329ac

The status of the job follows the flow defined in Figure 9.

Cloud free mosaic generation status query XML sample response
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="https://borealweb.nfis.org/tb15d100wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="">
    <wps:Process wps:processVersion="1.0.0">
        <ows:Identifier>get_status</ows:Identifier>
        <ows:Title>Get Status</ows:Title>
        <ows:Abstract>Retrieve the Job Status</ows:Abstract>
    </wps:Process>
    <wps:Status creationTime="2019-08-21T15:46:22Z">
        <wps:ProcessSucceeded>PyWPS Process Get Status finished</wps:ProcessSucceeded>
    </wps:Status>
    <wps:ProcessOutputs>
        <wps:Output>
            <ows:Identifier>status</ows:Identifier>
            <ows:Title>Job Status</ows:Title>
            <ows:Abstract></ows:Abstract>
            <wps:Data>
                <wps:LiteralData uom="urn:ogc:def:uom:OGC:1.0:unity" dataType="string">finished</wps:LiteralData>
            </wps:Data>
        </wps:Output>
    </wps:ProcessOutputs>
</wps:ExecuteResponse>

8.3.4. Cloud free mosaic download

When the status query indicates processing has completed for the required Job ID, the URL for downloading of generated mosaic can be retrieved. A specific WPS execute service is exposed with the following parameters:

Table 5. Cloud free mosaic generation status query request parameters
Keyword Description Sample Value

identifier

The name of action to be executed. Fixed get_result

get_result

job_id

The Job ID returned by the cloud free mosaic generation XML sample response.

d7111976-3a9f-401c-a3d2-c1a5c30329ac

An example of this request is:

https://borealweb.nfis.org/tb15d100wps?
  service=WPS&
  version=1.0.0&
  request=Execute&
  identifier=get_result
  datainputs=job_id=d7111976-3a9f-401c-a3d2-c1a5c30329ac

If the requested job is finished and the completion time is within the retention time period, the URL of the GeoTIFF RGB cloud free mosaic is returned. The URL is used to download the mosiac via standard HTTP protocol.

Code Example XML
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="https://borealweb.nfis.org/tb15d100wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="">
    <wps:Process wps:processVersion="1.0.0">
        <ows:Identifier>get_result</ows:Identifier>
        <ows:Title>Get Result</ows:Title>
        <ows:Abstract>Retrieve the Job Result</ows:Abstract>
    </wps:Process>
    <wps:Status creationTime="2019-08-21T15:46:53Z">
        <wps:ProcessSucceeded>PyWPS Process Get Result finished</wps:ProcessSucceeded>
    </wps:Status>
    <wps:ProcessOutputs>
        <wps:Output>
            <ows:Identifier>status</ows:Identifier>
            <ows:Title>Job Status</ows:Title>
            <ows:Abstract></ows:Abstract>
            <wps:Data>
                <wps:LiteralData uom="urn:ogc:def:uom:OGC:1.0:unity" dataType="string">https://borealweb.nfis.org/tb15d100wps/d7111976-3a9f-401c-a3d2-c1a5c30329ac</wps:LiteralData>
            </wps:Data>
        </wps:Output>
    </wps:ProcessOutputs>
</wps:ExecuteResponse>

8.3.5. Orchestrator

This software component is the core of the ML system and is in charge of searching and downloading product bands, loading and triggering the model, and creating the final cloud free mosaic in GeoTIFF RGB format. In order to optimize the execution performance of the Orchestrator (considering also that downloading of product bands is time consuming), the Orchestrator was designed with the following requirements:

  • Only one product at the time is handled

  • Product bands are retrieved only if the area covered by the current product still contains some clouds in the temporary mosaic; otherwise it skips to the next product

  • The mosaic generation ends as soon as the Petawawa area is entirely cloud free, or imagery products are no longer available.

The NRCan forestry CSW service endpoint is located at https://saforah2.nfis.org/geonetwork-main/srv/eng/csw.

Note
Filtering and metadata result assumption

The GetRecords request is sent with prf as specific filter in order to retrieve only products covering the Petawawa Research Forest. The csw:GetRecordsResponse does not contain a dedicated field for cloud coverage but this information is available in the response (refer to the following snippet) as a "free text property". There is an assumption that this value is always present in order to skip products with a cloud coverage greater than 70%.

Following a GetRecords request, the WCS Server returns all matching products tagged with the prf string and that were acquired within the range of the start / stop parameters provided in the job request.

csw:GetRecordsResponse snippet for Cloud Coverage
<gmd:abstract xsi:type="gmd:PT_FreeText_PropertyType">
    <gco:CharacterString>Sentinel-2 surface reflectances images (L2A) in .SAFE format. The surface reflectance products were generated by applying the Sen2Cor algorithm to the Top of Atmosphere (L1C) Sentinel-2 images provided by the European Space Agency. For more information on the Sen2Cor algorithm please visit http://step.esa.int/main/third-party-plugins-2/sen2cor/.

        Sensor: MSI
        Platform: Sentinel2A
        Acquisition Date: 2017-07-18
        Provider: European Space Agency
        Cell Size (m): 20
        Cloud Cover (%): 7.9388
    </gco:CharacterString>
    ...
    ...
    ...
</gmd:abstract>

The main configuration parameters for the Orchestrator are stored inside the config.yaml file:

config.yaml sample file
# Logging level
logging_level: INFO

# Workspace and products download path
workspace_path: /data/ogctb15/workspace
download_path: downloads

#
# Mosaicing
#

# CSV endpoint
csw_endpoint: "https://saforah2.nfis.org/geonetwork-main/srv/eng/csw"

# Cloud coverage percentage threshold to accept image
cloud_threshold: 70

#
# Neural Network
#

# Check point paths
cnn_checkpoint_path: "/data/ogctb15/checkpoints"
cnn_checkpoint_RGB: "ModelV2-CloudDetectionNetV2_RGB_epoch-100.20190727.pth"
cnn_checkpoint_RGBNIR: "ModelV2-CloudDetectionNetV2_RGBNIR_epoch-100.20190727.pth"

# Tile size
tile_width: 224
tile_height: 224

# Petawawa shape file
petawawa_shp: '/data/ogctb15/shapefiles/prf/petawawa_research_forest.shp'

# Mosaic retention time (minutes)
retention_time: 30

First the Orchestrator checks if the area of the current product still contains clouds. If this is true, the bands are downloaded via a WCS endpoint (one WCS request for each band) in the dedicated job folder in the Workspace. The number of bands retrieved will be either 3 or 4 according to the required ML model.

Note
WCS Coordinate Reference Systems

For all GetCoverage requests, for the BoundingBox coordinates either EPSG:26917 or EPSG:26918 is used as CRS, according to the relevant product.

Once all bands are locally available, they are cut into 224 x 224 pixels tiles via the Bands Slicer to be used for training the ML model. The requested ML model is loaded and run tile by tile (each tile composed by the different bands). The output is an equivalently sized black & white image showing cloud presence (in white).

Table 6. Sample cloud masks generated for the RGB bands for two different tiles
Red Green Blue Cloud Mask
6a Components R.tile.896.448
6a Components G.tile.896.448
6a Components B.tile.896.448
6a Components M.tile.896.448