I. Executive Summary
The Testbed 18 task Identifiers for Reproducible Science explored and developed EO workflows demonstrating the best practices from FAIR data and reproducible science whilst exploring the usability of Whole Tale as a component for reproducible workflows. EO workflows involve multiple processing steps such as imagery pre-processing, training of AI / ML models, the fusion / mosaicking of imagery together, and other analytical processes. In order for EO workflows to be reproducible, each step of the process must be documented in sufficient detail and that documentation needs to be made available to scientists and end users.
The following five participating organizations contributed workflows for the Testbed 18 Reproducible Science task.
52 Degrees North developed a Whole Tale workflow for land cover classification.
Arizona State University is developing a reproducible workflow for a deep learning application for target detection from earth observation imagery.
Ecere worked on the implementation of reproducible workflows following the approach described in the OGC API Process Part 3: Workflows and Chaining for Modular OGC API Workflows.
GeoLabs developed a reproducible workflow that runs an OGC API Process and Feature Server instance within a Whole Tale environment.
Terradue developed a water body detection Application Package to cover the identifier assignment and reproducibility from code to several execution scenarios (local, Exploitation Platform, Whole Tale) and is the editor for the Reproducible Best Practices ER, which is another component of the Reproducible Science stream.
Over the course of the Reproducible Science task multiple considerations and limitations for reproducible workflows were discovered including the following.
The expansion of FAIR to include replicability, repeatability, reproducibility, and reusability (reproducible-FAIR).
Replicability: A process with the same input yields the same output.
Repeatability: A process with a similar input yields the same output.
Reproducibility: different inputs, platforms, and outputs result in the same conclusion.
Reusability: The ability to use a specific workflow for different areas with the same degree of accuracy and reliability on the output.
Addressing randomness in deep learning applications.
Addressing the limitation of Whole Tale’s inability to assign a DOI to a binary docker image used to build a Whole Tale experiment.
Recommended future work includes the impact of FAIR workflows for healthcare use cases which makes data more available and reliable to researchers, healthcare practitioners, emergency response personnel, and decision makers.
II. Keywords
The following are keywords to be used by search engines and document catalogues.
testbed, docker, web service, reproducibility, earth observation, workflows, whole tale, deep learning, fair
III. Security considerations
No security considerations have been made for this document.
IV. Submitters
All questions regarding this document should be directed to the editor or the contributors:
Table — Submitters
Name | Organization | Role |
---|---|---|
Paul Churchyard | HSR.health | Editor |
Ajay K. Gupta | HSR.health | Editor |
Martin Pontius | 52 North | Contributor |
Chia-Yu Hsu | Arizona State University | Contributor |
Jerome Jacovella-St-Louis | Ecere | Contributor |
Patrick Dion | Ecere | Contributor |
Gérald Fenoy | GeoLabs | Contributor |
Fabrice Brito | Terradue | Contributor |
Pedro Goncalves | Terradue | Contributor |
Josh Lieberman | OGC | Contributor |
V. Abstract
The OGC’s Testbed 18 initiative explored the following six tasks.
1.) Advanced Interoperability for Building Energy
2.) Secure Asynchronous Catalogs
3.) Identifiers for Reproducible Science
4.) Moving Features and Sensor Integration
5.) 3D+ Data Standards and Streaming
6.) Machine Learning Training Data
Testbed 18 Task 3, Identifiers for Reproducible Science, explored and developed workflows demonstrating best practices at the intersection of Findable, Accessible, Interoperable, and Reusable (or FAIR) data and reproducible science.
The workflows developed in this Testbed included:
the development of a Whole Tail workflow for land cover classification (52 Degrees North);
the development of a reproducible workflow for a deep learning application for target detection (Arizona State University);
the implementation of reproducible workflows following the approach described in the OGC API Process Part 3: Workflows and Chaining for Modular OGC API Workflows (Ecere);
the development of a reproducible workflow that runs an OGC API — Process and Feature Server instance within a Whole Tale environment (GeoLabs); and
the development of a water body detection Application Package to cover the identifier assignment and reproducibility from code to several execution scenarios (local, Exploitation Platform, Whole Tale) (Terradue).
Testbed 18 participants identified considerations and limitations for reproducible workflows and recommendations for future work to identify the benefits of reproducible science for healthcare use cases.
Testbed-18: Identifiers for Reproducible Science Summary Engineering Report
1. Scope
This report is a summary of activities undertaken in the execution of the Testbed 18 Identifiers for Reproducible Science Stream. This included the development of best practices to describe all steps of an Earth Observation scientific workflow, including:
input data from various sources such as files, APIs, and data cubes;
the workflow itself with the involved application(s) and corresponding parameterizations; and
output data.
The participants were also tasked with producing reproducible workflows and examining the feasibility of Whole Tale as a tool for reproducible workflows.
2. Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
Open API Initiative: OpenAPI Specification 3.0.2, 2018 https://github.com/OAI/OpenAPI-Specification/blob/master/versions/3.0.2.md
van den Brink, L., Portele, C., Vretanos, P.: OGC 10-100r3, Geography Markup Language (GML) Simple Features Profile, 2012 http://portal.opengeospatial.org/files/?artifact_id=42729
W3C: HTML5, W3C Recommendation, 2019 http://www.w3.org/TR/html5/
Schema.org: http://schema.org/docs/schemas.html
R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee: IETF RFC 2616, Hypertext Transfer Protocol — HTTP/1.1. RFC Publisher (1999). https://www.rfc-editor.org/info/rfc2616.
E. Rescorla: IETF RFC 2818, HTTP Over TLS. RFC Publisher (2000). https://www.rfc-editor.org/info/rfc2818.
G. Klyne, C. Newman: IETF RFC 3339, Date and Time on the Internet: Timestamps. RFC Publisher (2002). https://www.rfc-editor.org/info/rfc3339.
M. Nottingham: IETF RFC 8288, Web Linking. RFC Publisher (2017). https://www.rfc-editor.org/info/rfc8288.
H. Butler, M. Daly, A. Doyle, S. Gillies, S. Hagen, T. Schaub: IETF RFC 7946, The GeoJSON Format. RFC Publisher (2016). https://www.rfc-editor.org/info/rfc7946.
3. Terms, definitions and abbreviated terms
This document uses the terms defined in OGC Policy Directive 49, which is based on the ISO/IEC Directives, Part 2, Rules for the structure and drafting of International Standards. In particular, the word “shall” (not “must”) is the verb form used to indicate a requirement to be strictly followed to conform to this document and OGC documents do not use the equivalent phrases in the ISO/IEC Directives, Part 2.
This document also uses terms defined in the OGC Standard for Modular specifications (OGC 08-131r3), also known as the ‘ModSpec’. The definitions of terms such as standard, specification, requirement, and conformance test are provided in the ModSpec.
For the purposes of this document, the following additional terms and definitions apply.
3.1. API Coverages
“A Web API for accessing coverages that are modeled according to the Coverage Implementation Schema (CIS) 1.1. Coverages are represented by some binary or ASCII serialization, specified by some data (encoding) format.” Open Geospatial Consortium
3.2. API Features
“A multi-part standard that offers the capability to create, modify, and query spatial data on the Web and specifies requirements and recommendations for APIs that want to follow a standard way of sharing feature data.” Open Geospatial Consortium
3.3. GeoTIFF
“A GeoTIFF file extension contains geographic metadata that describes the actual location in space that each pixel in an image represents.” Heavy.ai
3.4. Copernicus CORINE Land Cover dataset
A collection of Land Cover images that covers over 44 classes since 1985. Copernicus
3.5. Data Cube
“A multi-dimensional (“n-D”) array of values, with emphasis on the fact that “cube” is just a metaphor to help illustrate a data structure that can in fact be 1- dimensional, 2-dimensional, 3-dimensional, or higher-dimensional.” Open Geospatial Consortium
3.6. DVC
“Open-Source Version Control System for Machine Learning Projects” DVC
3.7. GDAL
“A translator library for raster and vector geospatial data formats that is released under an MIT style Open Source License by the Open Source Geospatial Foundation.” GDAL
3.8. Non-Deterministic Models
An algorithm that can exhibit different behaviors on different runs which are useful for finding approximate solutions when an exact solution is far too difficult or expensive to derive using a deterministic algorithm. Engati
3.9. Parameterization Tuning / Tuning Parameters / Hyperparameters
Parameters that are a component of machine learning models that cannot be directly estimated from the data, but often control the complexity and variances in the model. Kuhn M. and Johnson K.
3.10. PyTorch
“An open source machine learning framework that accelerates the path from research prototyping to production deployment.” PyTorch
3.11. Whole Tale
“A scalable, open source, web-based, multi-user platform for reproducible research enabling the creation, publication, and execution of tales — executable research objects that capture data, code, and the complete software environment used to produce research findings.” Whole Tale
3.12. LandSat
The NASA/USGS Landsat Program provides the longest continuous space-based record of Earth’s land in existence. Landsat data give us information essential for making informed decisions about Earth’s resources and environment. National Air and Space Administration
3.13. Sentinel-2
A European wide-swath, high-resolution, multi-spectral imaging mission. European Space Agency
3.14. Software Heritage
Software Heritage is an organization that allows for the preservation, archiving, and sharing of source code for software. Software Heritage
3.15. ISEA3H DGGS
Icosahedral Snyder Equal Area Aperture 3 Hexagon Discrete Global Grid System — A specification for equal area DGGS based on the Icosahedral Snyder Equal Area (ISEA) projection Southern Oregon University
3.16. RDF Encoding
A metadata format that “provides interoperability between applications that exchange machine-understandable information on the Web.” W3
3.17. Zenodo
A platform and repository developed and operated by CERN to enable the sharing of research data and outputs. <<zenodo>
3.18. CodeMeta
“CodeMeta contributors are creating a minimal metadata schema for science software and code, in JSON and XML. The goal of CodeMeta is to create a concept vocabulary that can be used to standardize the exchange of software metadata across repositories and organizations.” CodeMeta
3.19. Abbreviated terms
ADES
Application Deployment and Execution Service
API
Application Programming Interface
AWS
Amazon Web Services
COG
Cloud Optimized GeoTIFF
CWL
Common Workflow Language
DGGS
Discrete Global Grid System
EMS
Exploitation Platform Management Service
ESA
European Space Agency
FAIR
Findable, Accessible, Interoperable, Reusable
IANA
Internet Assigned Numbers Authority
MIME
Multipurpose Internet Mail Extensions
MODIS
Moderate Resolution Imaging Spectroradiometer
NASA
National Air and Space Administration
NSF
National Science Foundation
ODC
Open Data Cube
SCM
Software Configuration Management
SPDX
Software Package Data Exchange
STAC
Spatio-Temporal Asset Catalog
SWG
Subject Working Group
USGS
United States Geological Survey
WG
Working Group
4. Introduction
The OGC’s Testbed 18 initiative explored six tasks, including: Advanced Interoperability for Building Energy; Secure Asynchronous Catalogs; Identifiers for Reproducible Science; Moving Features and Sensor Integration; 3D+ Data Standards and Streaming; and Machine Learning Training Data.
This component of OGC’s Testbed 18 focuses on the Identifiers for Reproducible Science, a part of TestBed Thread 3: FUTURE OF OPEN SCIENCE AND BUILDING ENERGY INTEROPERABILITY (FOB).
Issues around true science are the topic of numerous academic and technical journals. A common theme among these scholarly articles is that the reproducibility of studies is a key aspect of science.
OGC’s mission is to make location information Findable, Accessible, Interoperable, and Reusable (FAIR). This Testbed task will explore and develop workflows demonstrating best practices at the intersection of FAIR data and reproducible science.
This task shall develop best practices to describe all steps of a scientific workflow, including:
data curation from various authoritative sources such as files, APIs, and data cubes;
the workflows themselves supporting multiple applications and corresponding parameterization tuning for machine learning processes; and
workflow outputs or results that support decision-making.
The workflows included in this component represent key areas of scientific discovery leveraging location information, Earth Observation data, and geospatial processes.
The description of the models will discuss how each step of the workflow can abide by FAIR principles.
Testbed 18 builds on the OGC’s past work as well as broader industry efforts. One of the tasks of the Testbed is to explore the utilization of Whole Tale as a tool for reproducible workflows and ideally work collaboratively with the Whole Tale team to identify and address limitations in the use of Whole Tale for reproducible workflows. Some of the work in the individual workflows build off of the participant’s efforts in previous TestBeds.
5. Common Considerations and Limitations for Reproducible Workflows
Over the course of this testbed a number of considerations and limitations were discovered pertaining to the workflows of the participants. The common considerations and limitations identified across multiple workflows are discussed in this section. There are a few considerations that pertained to certain workflows which are discussed in the individual component sections that follow.
5.1. The Expansion of FAIR
Replicability: A process with the same input yields the same output.
An analysis of soil characteristics of a soil sample produces the same result each time the process is run on the same sample.
Repeatability: A process with a similar input yields the same output.
An analysis of soil characteristics performed on an identical but different soil sample produces the same result.
Reproducibility: Different inputs, platforms, and outputs result in the same conclusion.
The classification workflow for a particular city based on an analysis of Landsat data should have a similar result when performed by aerial or other satellite imagery taken at the same time for the same location. Additionally, the workflow and results should be the same whether run on different Cloud-based computing environments (e.g., Google Cloud Platform (GCP), Microsoft Azure, or Amazon Web Services).
Reusability: The ability to use a specific workflow for different areas with the same degree of accuracy and reliability on the output.
An image classification workflow created for a specific city could also be used to classify a different city within a level of accuracy or confidence interval.
5.2. Randomness in Deep Learning Applications
Randomness is important for deep learning models and applications for optimization and generalization. However, the randomness inherent in the models makes true reproducibility a challenge.
5.2.1. Where Does Randomness Appear in Deep Learning Models?
Hardware
Environment
Software/framework: different version releases, individual commits, operating systems, etc,
Function: how the models are written as code and the package dependencies based on different programming languages
Algorithm: random initialization, data augmentation, data shuffling, and stochastic layers (dropout, noisy activations)
5.2.2. Why is Randomness Critical and Important to Deep Learning Applications?
Different outputs from the same input: normally an expected result is the same output given the same input, e.g., classification or detection, but sometimes some “creativity” is necessary in the model. Examples are: the first move of playing GO; draw pictures given a blank paper; etc.
Learning/optimization: a neural network loss function has many local minima and it is easy to get stuck at the local minima during the optimization process. Randomness in many algorithms allows the optimization to bounce out of the local minima, such as random sampling in a stochastic gradient descent (SGD).
Generalization: randomness brings better generalization in the network by injecting noise to the learning process, such as Dropout
5.2.3. How to Limit the Source of Randomness and Non-Deterministic Behaviors?
Using PyTorch as an example, the sources of randomness can be limited through:
Control sources of randomness
torch.manual_seed(0): control the random seed of PyTorch operations; and
random.seed(0): control the random seed of customized Python operations.
Avoiding using non-deterministic algorithms
torch.backends.cudnn.benchmark = False: for a new set of hyperparameters, the cudnn library would run different algorithms, benchmark them, and select the fastest one. Disabling the feature could cause a reduction in performance.
5.2.4. What is the Trade-Off Between Randomness and Reproducibility?
Randomness plays an important role in deep learning and completely similar results are not guaranteed and should not be expected. As such, the same or similar results necessitated by reproducibility may be an important, but not a critical, issue in deep learning models. As such, transparency is important in the reproducibility of deep learning models so that every step of the process can be examined.
5.3. Cloud Utilization Cost Estimation
There is not always clarity in Cloud utilization costs. Any Cloud-based development efforts should make a concerted effort to track fees associated with the Cloud infrastructure. Most, if not all, Cloud hosting firms make tools available to help anticipate costs. For instance, AWS has a Cost Calculator that can help provide insights into future Cloud hosting and utilization fees.
6. Individual Component Descriptions
6.1. 52°North
6.1.1. Goals of Participation
The goals of participation included the selection and development of a viable workflow on Whole Tale that provides the following features.
End-users will be able to experience how Spatial Data Analysis can be published in a reproducible FAIR manner.
The implementation of this task will help OGC WG derive requirements and limitations of existing standards for the enablement of reproducible FAIR workflows.
Developers will be able to follow a proof of concept for setting up a reproducible FAIR workflow based on OGC standards and an Open Data Cube instance.
6.1.2. Contributed Workflows and Architecture
52°North brought in a selection of use-cases from their own and partner research activities. From these potential workflows the use-case “Exploring Wilderness Using Explainable Machine Learning in Satellite Imagery” was chosen. This scientific study was conducted as part of the “KI:STE — Artificial Intelligence (AI) strategy for Earth system data” project and was already published on arXiv (https://arxiv.org/abs/2203.00379). The goal of the study was the detection of wilderness areas using remote sensing data from Sentinel-2. Moreover, the developed machine learning models allow the interpretation of the results by applying explainable machine learning techniques. The study area is Fennoscandia (https://en.wikipedia.org/wiki/Fennoscandia). For this region the AnthroProtect dataset was prepared and openly released (http://rs.ipb.uni-bonn.de/data/anthroprotect/). This dataset consists of preprocessed Sentinel-2 data. The regions of interest were determined using data from the Copernicus CORINE Land Cover dataset and from the World Database on Protected Areas (WDPA). Additionally, land cover data from five different sources are part of the AnthroProtect dataset: Copernicus CORINE Land Cover dataset, MODIS Land Cover Type 1, Copernicus Global Land Service, ESA GlobCover, and Sentinel-2 scene classification map.
In order to make the data available inside Whole Tale and to investigate reproducibility aspects of OGC APIs in conjunction with Open Data Cube, the AnthroProtect dataset was imported and indexed in an Open Data Cube instance and published via API Coverages and STAC. It would have been beneficial to offer some sub-processes of the workflow via API Processes, but this was not possible within these Testbed activities.
The source code of the original study is available at https://gitlab.jsc.fz-juelich.de/kiste/wilderness. A slightly modified version is available at https://gitlab.jsc.fz-juelich.de/kiste/asos and was used as a starting point for Testbed 18. Based on these developments a separate github repository was created (https://github.com/52North/testbed18-wilderness-workflow) which includes parts of the original workflow. From this repository, a tale on Whole Tale (https://dashboard.wholetale.org/run/633d5fb4eb89f198ef8ce83f) was created which can be executed by interested users to reproduce parts of the study.
6.1.3. Workflow Description
Figure 1 shows an overview of the workflow steps with their inputs and outputs performed in the original study.
The first step of the workflow, the preparation of the AnthroProtect dataset, was performed with Google Earth Engine (GEE). The download and preprocessing is described in the research article in detail and can be executed using Jupyter notebooks in a sub-project of the source repository (https://gitlab.jsc.fz-juelich.de/kiste/asos/-/tree/main/projects/anthroprotect). Due to time constraints this step was excluded in the Testbed activities. The prepared dataset can be downloaded as a zip file and is regarded as the “source of truth” from where reproducibility is enhanced. Some thoughts on reproducibility regarding closed source APIs/services like GEE are described in 52°North’s future work section.
Steps 2 and 3 include the training of the ML model and a sensitivity analysis (Activation Space Occlusion Sensitivity (ASOS)) which helps to interpret model results. For details, we refer to the research article mentioned in the previous paragraph. As these steps are processing intensive with processing times that are not well-suited for demonstration purposes, they are not part of the Whole Tale tale. However, it would be interesting to set up OGC API Processes for these. Further reproducibility aspects of machine learning could be studied, e.g., how does the choice of training data or hyperparameter tuning influence model weights or how ML models can be versioned, and it could be demonstrated how the OGC Processing API can contribute to reproducibility.
The trained model and the calculated sensitivities can be used to analyze Sentinel-2 scenes to predict activation and sensitivity maps. The sensitivity maps show the classification of a scene as wild or anthropogenic. Workflow step 5.1 allows the performance of such an analysis with available Sentinel-2 samples. It is the core of the developed Whole Tale tale in these Testbed activities. Workflow Step 4 allows for the inspection of the activation space in detail and for the investigation of how areas in the activation space relate to land cover classes.
Figure 1 — 52°North Workflow Diagram
6.1.4. Local Execution
The original study is already designed and published in a way that reproducibility is enhanced. It is published on arXiv and has a DOI assigned: https://doi.org/10.48550/arXiv.2203.00379. The publication includes a research article as a pdf and links to associated datasets and the source code repository. Users can download these resources and repeat the analysis on their system of choice, such as their local machine.
The first experiment of the Testbed activities was to download the resources and perform the analysis on a local computer. While it was possible to do so, some challenges were observed. Mainly, a user needs to set up the runtime environment on its own. The setup of the runtime environment involves installing python and system libraries which can be challenging for non-technical users. As the workflow needs GDAL, the system library has to be installed in addition to the Python GDAL library with a matching version. Another challenge of setting up the runtime environment is the choice of the processing device, CPU or GPU. By default GPUs are used in many ML applications and could not be configured differently. This problem was fixed during Testbed. Lastly, the chosen system needs to provide sufficient memory and disk space to perform the analysis. In the given example, the technical requirements are quite moderate (~50 GB disk space and ~10 GB RAM needed), however, in many other use cases the technical requirements on the hardware could become a practical limitation.
The challenge of setting up the runtime environment can be addressed by using Docker and running the application inside a Docker container. To address problems with hardware requirements and to eliminate the necessity of downloading or moving large amounts of data, parts of a workflow can be encapsulated in an OGC API Process which runs the workflow close to where the data resides. In this case it is important to ensure that processes and process inputs and outputs are also reproducible. Experiments with running parts of the workflow on the Whole Tale platform are described in the upcoming sections.
6.1.5. Supporting Infrastructure
As mentioned above, the AnthroProtect dataset should be made available via OGC APIs to be used inside of Whole Tale without uploading the whole dataset within Testbed 18. Moreover, reproducibility aspects of data cube infrastructures should be investigated. Therefore, an Open Data Cube instance is set up and the AnthroProtect dataset is added to the Open Data Cube index. This is done in a dedicated Docker container. The Docker image is publicly available via https://hub.docker.com/r/52north/opendatacube-importer as well as the source code via https://github.com/52North/opendatacube-importer. When the container is run it will download the AnthroProtect dataset from the public repository as ZIP file (https://uni-bonn.sciebo.de/s/6wrgdIndjpfRJuA), check the file hash (sha256), unpack the data, create ODC metadata files (yaml format), and add the files to the ODC index. Data that is indexed this way can be discovered and loaded using ODC’s Python API. Afterwards, a pygeoapi instance is deployed which runs in a separate Docker container (https://hub.docker.com/r/52north/pygeoapi-opendatacube). The image extends the default pygeoapi Docker image (https://hub.docker.com/r/geopython/pygeoapi) with the pygeoapi-odc-provider (https://github.com/52North/pygeoapi-odc-provider) in order to serve data from Open Data Cube via OGC APIs (API Coverages). Both Docker images are built automatically in a pipeline using GitHub Actions when a new GitHub tag is created. After successfully being built, the Docker images are pushed to the respective image repository on Docker Hub. The images are referenced in a Kubernetes setup using Kustomize when defining which containers should be run.
- name: pygeoapi
image: 52north/pygeoapi-opendatacube:0.4.0
As described in the considerations and limitations section of GeoLab’s component, tags on Docker Hub are not completely reliable as it is possible to publish a different image with the same tag. While the steps explained above include some aspects of reproducibility in terms of setting up an infrastructure there is also a need for reproducibility when discovering and using data and associated services. Over the past few years, the STAC specification has evolved into a de facto standard which has the goal of describing spatio-temporal data for easy discovery and access. The Open Data Cube community has already made steps to extend ODC to enable users to index datasets using STAC.
QUESTION:
Is it possible to use the ODC metadata interface (run in PostgreSQL) to index existing analysis-ready data sets? Suppose one has a lot of data already stored in Amazon AWS which is analysis-ready. How is this data ingested in ODC?
ANSWER:
Not only is this possible, but this is the strategic vision for DEA’s delivery architecture using ODC. Datasets from S3 can be indexed individually — however we are currently working with Radiant Earth to both publish our datasets in AWS S3 with STAC metadata and to extend ODC to be able to index datasets using STAC. Using STAC’s principles of static, lightweight representation of spatial metadata would mean large collections could be indexed quickly, with a minimum of GET requests.
— https://www.opendatacube.org/faq,
There are also tools and explanations to jointly work with ODC and STAC (e.g., https://github.com/opendatacube/odc-stac or https://docs.digitalearthafrica.org/en/latest/sandbox/notebooks/Frequently_used_code/Downloading_data_with_STAC.html).
In the following, exemplary STAC items are presented for the AnthroProtect dataset that is indexed in an Open Data Cube instance. A specific focus is put on three STAC extensions: the Datacube extension (https://github.com/stac-extensions/datacube), the Scientific Citation extension (https://github.com/stac-extensions/scientific), and the File Info extension (https://github.com/stac-extensions/file). It has to be noted that they have different levels of maturity, with two of them only being proposals at time of writing. STAC consists of four semi-independent specifications: STAC Item, STAC Catalog, STAC Collection, and STAC API (https://stacspec.org/en/). Items describing specific assets (files), catalogs, and collections are a way of organizing items where collections require some more information compared to catalogs. The STAC API defines a REST API which provides the ability to search STAC Items. As top-level STAC Catalog, a catalog for Open Data Cube is defined as follows:
{
"stac_version": "1.0.0",
"stac_extensions": [],
"type": "Catalog",
"id": "opendatacube",
"title": "Open Data Cube, Testbed-18",
"description": "Open Data Cube instance for Testbed-18 - Identifiers for Reproducible Science",
"links": [
{
"rel": "root",
"href": "./catalog.json",
"type": "application/json"
},
{
"rel": "child",
"href": "./anthroprotect/collection.json",
"type": "application/json"
}
]
}
Which consists of only one child, the STAC Collection for the AnthroProtect dataset.
{
"stac_version": "1.0.0",
"stac_extensions": [
"https://stac-extensions.github.io/scientific/v1.0.0/schema.json"
],
"type": "Collection",
"id": "anthroprotect",
"title": "AnthroProtect dataset",
"description": "AnthroProtect dataset consisting of Sentinel-2 and land cover data",
"license": "CC-BY-NC-SA-3.0",
"sci:citation": "Stomberg, T. T., Stone, T., Leonhardt, J., Weber, I., & Roscher, R. (2022). Exploring Wilderness Using Explainable Machine Learning in Satellite Imagery. arXiv preprint arXiv:2203.00379.",
"sci:doi": "10.48550/arXiv.2203.00379",
"extent": {
"spatial": {
"bbox": [
[
5.20467414651499,
55.353564035013825,
30.630830204026555,
70.36171326846416
]
]
},
"temporal": {
"interval": [
[
"2020-07-01T00:00:00.000Z",
"2020-08-30T23:59:59.999Z"
]
]
}
},
"links": [
{
"rel": "root",
"href": "../catalog.json",
"type": "application/json"
},
{
"rel": "child",
"href": "./s2/collection.json",
"type": "application/json"
},
{
"rel": "child",
"href": "./s2_scl/collection.json",
"type": "application/json"
},
{
"rel": "child",
"href": "./lcs/collection.json",
"type": "application/json"
},
{
"rel": "parent",
"href": "../catalog.json",
"type": "application/json"
},
{
"rel": "cite-as",
"href": "https://doi.org/10.48550/arXiv.2203.00379"
}
]
}
This additionally defines the spatial and temporal extent of the data. For reproducibility, the license, sci:doi, and sci:citation fields are important to mention. The “sci” prefix indicates that those fields are coming from the “Scientific Citation” extension which is added to the “stac_extensions” list. They provide a way to relate the data to a scientific publication, in this case the publication where the AnthroProtect dataset was introduced and where users find more information about how the dataset was created, e.g., which preprocessing was applied. Note that the actual DOI url is added to the “links” list.
The collection has three child collections: “s2_scl,” “s2,” and “lcs.” These directly correspond to ODC products in the presented example. ODC products are a central unit for structuring data inside of an ODC instance. An ODC product primarily defines which bands are available. For each product, ODC datasets are added which define the paths to specific bands which can be in separate files, spatio-temporal extent, and coordinate reference system. An ODC product can be interpreted as a data cube in a narrower sense while an Open Data Cube instance as a whole provides an index and a discovery and access mechanism to heterogeneous and distributed data.
The next example is the STAC Collection for Sentinel-2 (“s2” collection).
{
"stac_version": "1.0.0",
"stac_extensions": [
"https://stac-extensions.github.io/datacube/v2.1.0/schema.json",
"https://stac-extensions.github.io/scientific/v1.0.0/schema.json"
],
"type": "Collection",
"id": "anthroprotect_s2",
"title": "Sentinel-2: MultiSpectral Instrument",
"description": "Multi-dimensional Sentinel-2 data cube in a STAC collection.",
"license": "CC-BY-NC-SA-3.0",
"sci:citation": "Stomberg, T. T., Stone, T., Leonhardt, J., Weber, I., & Roscher, R. (2022). Exploring Wilderness Using Explainable Machine Learning in Satellite Imagery. arXiv preprint arXiv:2203.00379.",
"sci:doi": "10.48550/arXiv.2203.00379",
"extent": {
"spatial": {
"bbox": [
[
5.20467414651499,
55.353564035013825,
30.630830204026555,
70.36171326846416
]
]
},
"temporal": {
"interval": [
[
"2020-07-01T00:00:00.000Z",
"2020-08-30T23:59:59.999Z"
]
]
}
},
"cube:dimensions": {
"x": {
"type": "spatial",
"axis": "x",
"extent": [
5.20467414651499,
30.630830204026555
],
"reference_system": 4326,
"step": 0.0001
},
"y": {
"type": "spatial",
"axis": "y",
"extent": [
55.353564035013825,
70.36171326846416
],
"reference_system": 4326,
"step": 0.0001
},
"time": {
"type": "temporal",
"extent": [
"2020-07-01T00:00:00Z",
"2020-08-30T23:59:59Z"
],
"description": "Data includes only one time slice which represents the 25th percentile of cloud-filtered scenes in the time interval described by the temporal extent"
},
"spectral": {
"type": "bands",
"values": [
"B2",
"B3",
"B4",
"B5",
"B6",
"B7",
"B8",
"B8A",
"B11",
"B12"
]
}
},
"links": [
{
"rel": "root",
"href": "../../catalog.json",
"type": "application/json"
},
{
"rel": "parent",
"href": "../collection.json",
"type": "application/json"
},
{
"rel": "item",
"href": "./inv_hydroelectric-letsi_2019-07-01-2019-08-30/inv_hydroelectric-letsi_2019-07-01-2019-08-30.json",
"type": "application/geo+json"
},
{
"rel": "item",
"href": "./anthropo_24.81681-64.10019-5_0/anthropo_24.81681-64.10019-5_0.json",
"type": "application/geo+json"
},
{
"rel": "cite-as",
"href": "https://doi.org/10.48550/arXiv.2203.00379"
}
]
}
Similar to the AnthroProtect collection, it uses the “Datacube” extension as well. This allows the STAC to specify the dimensions of the data cube (spatial, temporal, bands). Moreover, there are links to STAC Items. An exemplary item for the Sentinel-2 collection is shown in the following.
{
"stac_version": "1.0.0",
"stac_extensions": [
"https://stac-extensions.github.io/file/v2.1.0/schema.json",
"https://stac-extensions.github.io/scientific/v1.0.0/schema.json"
],
"type": "Feature",
"id": "inv_hydroelectric-letsi_2019-07-01-2019-08-30",
"collection": "anthroprotect_s2",
"properties": {
"datetime": "2020-08-01T12:00:00.000Z",
"projection": 32634,
"sci:citation": "Stomberg, T. T., Stone, T., Leonhardt, J., Weber, I., & Roscher, R. (2022). Exploring Wilderness Using Explainable Machine Learning in Satellite Imagery. arXiv preprint arXiv:2203.00379.",
"sci:doi": "10.48550/arXiv.2203.00379"
},
"bbox": [
20.26814539274694,
66.45825882626778,
20.496956405644095,
66.55002311657806
],
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
20.26814539274694,
66.45825882626778
],
[
20.26814539274694,
66.55002311657806
],
[
20.496956405644095,
66.55002311657806
],
[
20.496956405644095,
66.45825882626778
],
[
20.26814539274694,
66.45825882626778
]
]
]
},
"links": [
{
"rel": "root",
"href": "../../../catalog.json",
"type": "application/json"
},
{
"rel": "collection",
"href": "../collection.json",
"type": "application/json"
},
{
"rel": "parent",
"href": "../collection.json",
"type": "application/json"
},
{
"rel": "cite-as",
"href": "https://doi.org/10.48550/arXiv.2203.00379"
}
],
"assets": {
"default": {
"href": "https://18.testbed.dev.52north.org/geodatacube/stac/opendatacube/investigative/inv_airport-lucas-U314_45664214.tif",
"type": "image/tiff; application=geotiff",
"title": "inv_airport-lucas-U314_45664214",
"file:checksum": "a62448afe4db06bdae98db3e3b86e156c82582814c29f9ddeb21915ede490849",
"file:size": 18561018
}
}
}
In terms of reproducibility on the item level, the “File Info” extension is interesting as it allows for the specification of the file size and the file checksum. The latter helps to enhance the integrity of the data.
6.1.6. Deployment on Whole Tale
The Whole Tale initiative provides a multi-user platform where so-called tales can be created and used to reproduce scientific studies. A tale combines data, source code, the computational environment, and a narrative. By providing the computational environment some of the limitations and difficulties when repeating a study on a local machine can be overcome. Because the source code of the original scientific study depends on the whole dataset to be present in a specific folder structure, the code could not be directly used to create a tale as it is impractical to upload it manually. In principle external data can be added to a tale (https://wholetale.readthedocs.io/en/stable/users_guide/manage.html#external-data), however, only a few data repositories are supported at the time of this report (DataONE, Dataverse, Globus, and Zenodo), whereas the AnthroProtect dataset is hosted on Sciebo.
Moreover, configuration files have to be added to enable Whole Tale to build a Docker image using repo2docker, from which the runtime environment is created. Thus a separate GitHub repository was created specifically for reproducing parts of the study on Whole Tale (https://github.com/52North/testbed18-wilderness-workflow). The first step is to provide configuration files for repo2docker. The original study provides a Python library (https://gitlab.jsc.fz-juelich.de/kiste/asos/-/blob/main/setup.py) that is used in Python scripts and Jupyter Notebooks to perform the actual analyses. The library is built on top of PyTorch and has some more dependencies like GDAL and rasterio. In order to install the GDAL Python library, the GDAL system library has to be installed first, which can be done by providing an apt.txt file. A limitation is that the Docker image is based on Ubuntu 18.04 which restricts the GDAL version. However, if this is a problem, a custom Dockerfile can be provided for better control over the environment. Python environments can be set up by using an environment.yml, a requirements.txt, or a setup.py file. Because of some dependency issues, a more customized approach was chosen by using the postBuild file. It is also possible to define environment variables for the runtime environment by adding in a line such as the following to the postBuild file.
echo 'export MY_ENV_VAR="MyValue"' >> ~/.profile
However, these environmental variables are not available inside Jupyter Notebooks and are only available inside a terminal. It also has to be noted that the approach documented for repo2docker (https://repo2docker.readthedocs.io/en/latest/config_files.html#start-run-code-before-the-user-sessions-starts) is not supported by Whole Tale. Through personal communication with the development team of Whole Tale 52 Degrees North found that the development team is working on an alternative approach. Additionally, not all relevant logging statements are exposed to the user. Even though the build process of the Docker image was successful running the tale did not work initially. As the relevant logging statements which described the problem were not provided to the user this was difficult to debug. However, with the help of Whole Tale’s development team the issue could be fixed in the end.
After setting up the computational environment successfully, the tale can be run. 52°North chose Jupyter Notebook as environment but others are also possible (JupyterLab, MATLAB, RStudio, or STATA). From the original workflow, only a simplified version of Step 4 for inspecting the activation space and Step 5.1 for analyzing samples are part of the tale as a Jupyter Notebook. As a preparational step, there is an additional Jupyter Notebook for downloading samples via a Coverage API to be analyzed in step 5.1. Figure 2 shows how activation and sensitivity maps can be calculated for Sentinel-2 samples using the precalculated machine learning model.
Figure 2 — Screenshot of the Jupyter Notebook "51_analyze_samples"
One practical limitation was observed regarding the available resources in the runtime environment. If the samples are large, such that a memory of ~8 GB is exceeded, the kernel of the notebook will crash. As the tale was created initially from a GitHub repository it would be beneficial to be able to synchronize them. This is possible from within a running tale. From the Jupyter Notebook User Interface a terminal can be opened. As the .git repository is available in the tale workspace, changes from the original GitHub repository can be merged into the tale using git’s cli tool.
6.1.7. Considerations and Limitations for Reproducible Workflows:
Given the expected growth of demand for data in modern analysis and modeling applications, how well will data and processing environments be connected?
Will results need any level of pre-processing?
Will the Area of Interest need to be limited, and if so, how?
As processing loads grow with data and complexity of the analysis, what are acceptable processing times?
Will results need any level of pre-processing?
Is there a need for processing to be “out sourced” and made available through, for example, OGC API Processes?
6.1.8. Technical Interoperability Experiments
52°North shared their Tale (https://dashboard.wholetale.org/run/633d5fb4eb89f198ef8ce83f) and STAC with fellow Testbed 18 participants.
6.1.9. Lessons Learned
One important lesson learned is related to the code design of the original study. It uses a file-based logic to load the needed data inside of the various analysis steps. This means that all files are required to be available and to be stored in a specific folder structure. Moving completely to an API-based approach is not trivial and was out-of-scope for the Testbed 18 activities. Thus, only some parts of the workflow could be run on Whole Tale and studied with respect to the role of Open Data Cube and OGC APIs on reproducibility. While this might be considered as a specific challenge for the chosen workflow, it is probably encountered in many other applications as well as there are practical advantages of working file-based, e.g., it is simple, often faster (depending on the network speed and internet connection), self-contained, and reliable because an API could go offline. One way of supporting all analysis steps would be to provide them as processes via the OGC Processes API working directly on the files. This is described in the section “Future work.” With this background, the role of APIs to provide data for a workflow study could be restricted to collecting data only as a preparational step, i.e., data is downloaded via an API and stored in files which are then used in the analysis.
Another lesson learned is to be careful in the choice of the coordinate reference system when training a machine learning model with geo-referenced data and using it for model inference later. In the considered study, Sentinel-2 data is used in UTM coordinates with a spatial resolution of 10 m to train the machine learning model. For model inference inside of the Jupyter Notebook, data can be downloaded from a Coverage API which provides Sentinel-2 data in EPSG:4326 with 0.0001° resolution because the collection spans an area that crosses multiple UTM zones. Figure 3 shows a qualitative comparison of model predictions of the same Sentinel-2 sample but with a different coordinate reference system, one in UTM coordinates, the other in geographic coordinates. The latter one is stretched horizontally because the resolution is in degrees and is the same for latitude and longitude. While the general characteristics are very similar, some small deviations can be observed due to the resampling, e.g., in the south-western part of the images.
Figure 3 — Comparison of model predictions with different coordinate reference system and resolution (top: original UTM coordinates with 10 m resolution, bottom: EPSG:4326 coordinates with 0.0001° resolution.)
When working with Open Data Cube or other data cube software, STAC is a promising candidate to identify the provided data and ensure its use in a reproducible FAIR manner. STAC provides the potential to describe the data in a granular way from top-level views on complete datasets like Sentinel-2 to the level of a single file where the File Info extension offers the option to provide — amongst others — the file hash which can be checked to make sure the file has not changed. With the Scientific Citation, extension references to publications can be made so that users are able to understand where the data comes from and which preprocessing might have been applied or to acknowledge data providers and users by citation. The Datacube extension can be used to describe the dimensions of a data cube, however, when using these extensions it has to be noted that they have different levels of maturity and might still change or be disapproved. Specifically, for Open Data Cube there have been and are going to be more developments towards supporting STAC as a primary metadata language.
6.1.10. Future work
One important question that has not yet been answered is how the data and processing intensive workflow steps, i.e., the model training and the ASOS analysis, can be executed in a reproducible FAIR manner on the Whole Tale platform or similar platforms. Providing them via the OGC Processing API is an obvious approach. The processes could then be executed in a Juypter Notebook on Whole Tale and the process outputs would be the model weights (checkpoint.pt) and the ASOS object (asos.pkl), respectively. In order to test and reproduce different model runs that use, for example, different hyperparameters or training data, a strategy is needed to version these process outputs. One possibility is to use Data Version Control (https://dvc.org/) which was used by Arizona State University. The question would be how this can be integrated with Whole Tale. From a machine learning perspective, the influence of the choice of the coordinate reference system and the spatial resolution could also be further investigated.
Another aspect that could be further investigated is the reproducibility of data APIs, especially if they are not open source like the Google Earth Engine. In the considered workflow the AnthroProtect dataset was prepared using the Google Earth Engine and could be used for reproducibility experiments in the future. What metadata is needed to identify the data and how can it be tested that the same request always returns the same data? Is desirable that the data does not change or are there cases where this might not make sense?
Relating to the experiments with Open Data Cube and STAC, the existing tools to use STAC for metadata description, data discovery, and data loading could be evaluated. Of special interest is how the aforementioned STAC extensions (Datacube, Scientific Citation, File Info) can improve the reproducibility when using the tools.
6.2. Arizona State University
6.2.1. Goals of Participation
Arizona State University’s goal for Testbed 18 was the development of a general workflow to demonstrate the FAIR and reproducibility principles of a scientific deep learning application, which will contain the following components.
Input: Make the data easily accessible and re-useable. Be compliant with the OGC Standards including the new OGC API Standards.
Application: Develop a Jupyter notebook that includes executable code and detailed comments for each step (input, model, training, evaluation, and inference) of the deep learning workflow.
Output: Enable multiple outputs including standard object detection results and images with rendered prediction results which are compliant with OGC Standards.
The example use case developed for Testbed 18 was Target Detection: Identifying and drawing the bounding box of desired targets, as illustrated below.
Figure 4 — Arizona State University Target Detection
This reproducible workflow for deep learning applications, especially the details to limit nondeterministic behaviors, provide the following.
For End-users: share, deploy, and reuse date; video demonstration.
For OGC Standards Baseline: a use case for reproducible deep learning applications.
For Developers: a potential deep learning application reproducible workflow.
6.2.2. Contributed Workflow and Architecture with Challenges and Solutions
6.2.2.1. Workflow Tasks
Data preparation and web service hosting
Docker environment preparation
Application/Workflow development in Jupyter notebook
Reproducibility verification
6.2.3. General Workflow for a Deep Learning Application
Figure 5 — Arizona State University General Workflow
The above figure shows a general workflow for a deep learning application. It contains five components: data, environment, model, training, and inference. In each component, several factors would affect the reproducibility of a deep learning application. There are also dependencies between different components (input, feedback, etc.). When implementing a reproducible workflow, consideration of both internal and external factors and interactions is important. The following sections demonstrate how to design and implement a reproducible workflow for each component.
6.2.4. Object Detection Workflow
Figure 6 — Arizona State University Object Detection Workflow
To demonstrate the proposed workflow, object detection was used as an example. For each input image, the model identifies and draws bounding boxes on desired targets. The above figure shows the sample object detection workflow. There are five components.
Data: the DOTA dataset was used as the core data for the workflow. This dataset was used for object detection in aerial images. There are 18 categories, 11268 images, and 1793658 instances in the dataset.
Environment: PyTorch was used to build the deep learning workflow. PyTorch is an open source machine learning framework for fast research prototyping.
Model: Mask R-CNN was used as the sample model. This is an object detection/segmentation model. For each input image, the model will identify and draw bounding boxes/masks on the targets.
Training: the training calculates the loss between the predictions and the ground truths and uses backpropagation to update the network weights in order to minimize corresponding errors.
Inference: The inference generates the bounding boxes for the targets based on the model’s predictions and confidence scores.
6.2.5. Reproducible Data Workflow
https://nbviewer.org/gist/chiayuhsu/121e5c9cc68222bdc8cbd4bd75c80fee
This is the Jupyter notebook used to demonstrate the challenges and possible solutions of a reproducible data workflow.
6.2.6. Model Interoperability
For model interoperability, Open Neural Network eXchange (ONNX) is one of the initiatives that enables model sharing between different frameworks or ecosystems. ONNX is an open format for representing machine learning models. In ONNX, a model is represented as an acyclic dataflow graph with a list of nodes. Each node is a call to an operator with one or more inputs and one or more outputs. The graph also has metadata for documenting the authors, tasks, etc. A trained model with ONNX format can be used within different frameworks, tools, runtimes, and compilers.
6.2.6.1. Reproducibility and Replicability Discussions
Reproducibility: The ability to repeat an experiment with minor differences while still achieving the same qualitative result [kastner_c]. That is, the result is stable and the conclusion and findings are consistent even with slightly different settings.
Replicability: The ability to generate the same results with the same evaluation criteria under the same data, environment, model, training, and inference processes.
To achieve both reproducibility and replicability, it is important that sufficient details and specific steps are described. According to Peter and Florian [sugimura_peter_and_florian_hartl], “The root cause of most, if not all, reproducibility problems is missing information.” Following are discussions of challenges and possible solutions to the reproducibility and replicability in different components of a deep learning application.
6.2.6.2. Data
Challenges
Change in data: data changes over time, including adding data, removing data, and updating data.
Incorrect data transformation: for example, cleaning, augmentation, and changes in data distribution.
External data: some data processing steps may depend on external data sources that do not produce stable results.
Solutions
Data versioning and change tracking means keeping track and recording every data change at any instance. Depending on the size and structure of the data, how often the data is updated, and how often old versions are accessed, there are several different data versioning strategies.
Storing copies of entire datasets. Each time a dataset is modified, a new copy of the entire dataset is stored. Pros: this approach is easy to implement in a storage system and to provide access to different versions of content. Cons: large storage space demand.
Storing deltas between datasets. Pros: space efficient, Cons: restoring a specific version of a dataset may require many changes to the original dataset.
Storing offsets in append-only datasets. For append-only datasets, such as log files, applications can simply use the file size to reconstruct a file at any given time. Pros: simplest dataset versioning Cons: only applicable to specific dataset types.
Versioning corresponding transformations. Storing the original dataset and all steps involved to produce a specific version of dataset. Pros: may be more efficient to recreate the dataset on demand rather than to store it. Cons: the steps need to be deterministic.
Versioning available metadata, such as schema information, license, provenance, owner, etc.
Versioning the processing or editing steps applied to a specific version of a dataset. This applies to all processing steps including data collection/acquisition, data merging, data cleaning, or feature extraction. When using models with machine learning techniques, often input datasets are used that are quite removed from their original source. For example, data collected in the field, supplemented with data from other sources, cleaned in various manual and automated steps, and prepared with some feature extraction. At the stage where some data scientists may experiment with different network architectures, all that history and all the decisions that may have gone into the dataset, and all possible biases, may be lost. Versioning of the process can be recorded through metadata notes or drawn architectural data flow diagrams.
Using data with documented timestamps and versions. Recording the exact training data after data processing at training time and logging all inputs to the model (including those gathered from external sources) at inference time is useful.
6.2.6.3. Environment
Challenges
Hardware
Changes in GPU architectures would make reproducibility difficult unless some of the operations are enforced.
Floating-point discrepancy, either due to hardware settings, software settings, or compilers.
Non-deterministic behaviors: parallel floating point calculations in the GPU or auto-tuning features in libraries (CUDA, cuDNN)
Software / libraries
ML frameworks are constantly getting upgraded. These updates can cause changes in the results. For instance, Pytorch 1.7+ supports mixed-precision natively from the apex library from NVIDIA but previous versions did not. Also, changing from one framework to another (i.e., Tensorflow to Pytorch) will generate different results.
Solutions
Hardware: versioning the hardware, drivers, and all other parts of the environment. Besides versioning, the environment should fulfill the following criteria.
Having the ability to return to the previous state without destroying the setup.
Utilizing identical versions on multiple machines.
Setting randomization parameters.
Software: versioning frameworks or libraries. It is important to version both pipeline code (see Training section) and involved frameworks or libraries to ensure reproducible executions. To version library dependencies, the common strategies are as follows.
Using a package manager and declare versioned dependencies on libraries from stable repository, e.g., requirements.txt and pip or package.json and npm. Specifying specific releases of the dependencies, e.g., 2.3.1 instead of floating versions, e.g., 2.\* is important.
Committing all dependencies to the code repository.
Packaging all dependencies into virtual execution environments. This is another way to version frameworks and libraries. Furthermore, it also tracks the environment changes, such as drivers. Docker containers are a common solution.
6.2.6.4. Model
Challenges
The parameters of neural networks are initialized with random values, so training repeatedly on the same data will not yield the same model.
Randomness in some operations, e.g., dropout, random augmentations, mini-batch sampling, random noise introductions, etc.
Solutions
Nondeterminism from random numbers can be controlled by explicitly seeding the random number generator used.
Versioning models: models are usually saved as binary data files. In deep learning applications, small changes to data or hyperparameters can lead to changes in many model parameters. Therefore, versioning the deltas between different model versions may be meaningful. Any system that tracks versions of binary objects could be useful.
Versioning model provenance: this refers to tracking all inputs to a model, including data, hyperparameters, and pipeline code with its dependencies.
6.2.6.5. Training
Challenges
In distributed training, timing differences in distributed systems can affect the learned parameters due to the involved merging strategies.
The lack of proper logging of parameters such as hyperparameter values, batch sizes, learning rates, etc. during the model training results in difficulty in understanding and replicating the model.
Machine learning is an iterative process with lots of experimentation: changing the values of the parameters; checking performance of different algorithms; and fine-tuning to get good results. Recording important details once the experimental process gets long and more complicated makes recording more difficult.
Randomness in the software: batch ordering, data shuffling, and weight initialization
Non-deterministic algorithms such as stochastic gradient descent, Monte Carlo methods, mini-batch sampling, etc.
Solutions
Nondeterminism from random numbers can be controlled by explicitly seeding the random number generator used.
Versioning feature provenance: this refers to tracing how features in the training and inference data were extracted, that is, mapping data columns to the code version that was used to create them. Also, keeping individual feature generation code independent from one another.
Versioning code: track and record changes in code and algorithms during experimentation.
Versioning pipelines: a training pipeline can contain many stages for extracting features, calculating losses, updating parameters, and optimizing hyperparameters. Versioning all steps for tracking which versions of the individual parts went into creating a specific model is necessary. Pipeline code and hyperparameters can be expressed in normal code and configuration files and can be versioned like traditional code.
Versioning experiments: data scientists routinely experiment with different versions of extracting features, different modeling techniques, and different hyperparameters. Tracking information about specific experiments and their results is useful. The approach focuses on keeping and comparing results, often visualized in some dashboard, e.g., MLflow.
6.2.6.6. Inference
Challenges
Model versioning: are the models still available? Which model performs the best?
Model provenance: what training data was used? What is the feature extraction code? What version of the ML library/framework? What hyperparameters was used?
Model inference: will the same result be achieved using different inference approaches?
Solutions
Versioning model provenance: these metadata provide the connection of all the processing elements. For example, a model version v12 was built from data version v17, pipeline version v4.1, etc. Such metadata can also ensure the same code (data cleaning, feature extraction, etc.) is used at the inference time that was used at training.
Versioning deployed models: knowing which model version handles specific inputs during inference, as part of log files or audit traces, is useful. This application or user can track the model version responsible for every output over time and in the presence of A/B tests.
6.2.6.7. Tools
Many tools have been developed to help track versions of data, pipelines, and models. These tools also often help record further metadata, such as training time or evaluation results (e.g., prediction accuracy or robustness). An example follows.
In DVC, all processing steps are modeled as separate stages of a pipeline. Each stage is implemented by an executable that is called with some input data and produces some output data. The stages and data dependencies are described as pipelines in some internal format, such as the following example from the DVC documentation.
stages:
features:
cmd: jupyter nbconvert --execute featurize.ipynb
deps:
- data/clean
params:
- levels.no
outs:
- features
metrics:
- performance.json
training:
desc: Train model with Python
cmd:
- pip install -r requirements.txt
- python train.py --out ${model_file}
deps:
- requirements.txt
- train.py
- features
outs:
- ${model_file}:
desc: My model description
plots:
- logs.csv:
x: epoch
x_label: Epoch
meta: 'For deployment'
# User metadata and comments are supported
Both implementations of the stages and the pipeline description are versioned in a Git repository. DVC then provides command line tools to execute individual pipeline steps or the entire pipeline at once. DVC runs the executable of each step according to the dependencies for each step and tracks the versions of all inputs and outputs as part of metadata stored in the Git repository.
MLflow is a popular framework for tracking experiment runs. It is easy to log and track experiments and show results in dashboards. MLFlow stores results with corresponding hyperparameters for a run, source version of the pipeline, and recording training time. The dashboard is useful for comparing results of training with different hyperparameters, but it could also be used to compare different versions of the pipeline or training with different datasets.
Verdant Researchers often use Jupyter notebooks for exploratory work that does not commonly use Git for versioning because the code might not be in cohesive incremental pieces. In addition, the file format of notebooks that include output data and images together with the cells’ code in a JSON format does not lend itself easily to textual differences and many traditional version control and code review processes. Recently, there are some history recording tools designed for notebooks like Verdant. Verdant is a JupyterLab extension that automatically records the history of all experiments run in a Jupyter notebook and stores them in a tidy .ipyhistory JSON file designed to work alongside with and compliment any other version control you use, such as SVN or Git.
6.2.7. Additional Considerations and Limitations for Reproducible Deep Learning Workflows
What is reproducible training?
Under the same training code, same environment, and the same training dataset, the resulting trained model yields the same results under the same evaluation criteria.
What are the challenges of reproducible training?
Randomness in the software: batch ordering, data shuffling, and weight initialization.
Non-determinism in the hardware: parallel floating-point calculation in GPU and auto-tuning features in libraries such as CUDA, cuDNN.
Lack of systematic guidelines: system information to support reproducibility, e.g., resources (dataset environment), software (source code), metadata (dependencies), execution data (execution results, log), and how to manage them (Git, DVC, MLflow).
6.2.8. Lessons Learned
The reproducibility of deep learning applications and experiments has strong impacts and implications for science. Reproducibility brings credibility and trust into the system which are the foundations of scientific research and advancement. However, based on work in Testbed 18, the difficulties related to generating the same experimental results in deep learning applications due to its the intrinsic (models, optimization methods, etc.) and extrinsic (hardware, software, etc.) factors were identified. To address the reproducibility crisis, more focus and analysis on reproducibility are required in deep learning research. For example, what details regarding the application are needed to be provided to guarantee the reproducibility? How to analyze, estimate, and control the output differences from the details? It is important to develop a standard and systematic way to evaluate the reproducibility in deep learning as it has become a part of our everyday life?
6.3. Ecere
6.3.1. Goals of Participation
Ecere’s participation in Testbed 18 was to produce one or more reproducible FAIR workflow(s) following the approach to define workflows described in OGC API — Processes — Part 3: Workflows and Chaining.
Task objectives were as follows.
The reproducible workflow implementation will help establish best practices for reproducible science.
The task will provide use cases for Coverages and Processes SWG, EOXP and Workflows DWG.
The task use cases will provide developers with a potential approach to implement, define, and share reproducible workflows.
6.3.2. Contributed Workflow and Architecture
The Reproducible Workflows implementation allows users to define, execute, and retrieve outputs from workflows. Ecere has provided workflow execution endpoints providing deterministic results. The workflow accepts inputs from external OGC APIs with the ability to retrieve results and trigger processing for a region and/or time and/or resolution of interest. Several encodings are supported such as GeoTIFF, GeoJSON, and GNOSIS Map Tiles, among others.
The workflow scenarios explored for this task were crop classification using machine learning as well as coastal erosion in the Arctic. As a result of significant challenges encountered for each of these scenarios, the ability to demonstrate these workflows at the end of Testbed 18 was still limited, and efforts were ongoing to complete the development of the planned capabilities.
Additional processing capabilities supported as part of the capability to submit ad-hoc workfows to the instance deployed by Ecere include:
performing basic analytics e.g., derived fields like index calculations from multi-band imagery, filtering data (for instance to merge multiple scenes or eliminate values outside of a certain threshold);
gridding point clouds;
tracing elevation contours;
routing calculations; and
server-side map rendering.
Ecere’s instance uses the GNOSIS Map Server which provides support for several OGC API Standards (draft and approved), including the following.
OGC API — Processes (including Part 3: Workflows and chaining)
OGC API — Features (including Part 2: CRS by reference and Part 3: Filtering)
OGC API — Coverages
OGC API — Tiles
OGC API — Discrete Global Grid System
OGC API — 3D GeoVolumes
OGC API — Maps
In conjunction with this task, significant progress was made on improving the OGC API — Processes — Part 3 draft candidate Standard itself, writing the requirements and organizing them into requirements classes with enough clarity so as to facilitate the development of additional conforming implementations.
6.3.2.1. OGC API Processes — Part 3: Workflows and Chaining Diagram
Figure 7 — Coastal Erosion Workflow Diagram
6.3.3. Coastal Erosion Workflow
For Testbed 18, in collaboration with the University of Calgary, Ecere expressed the Coastal Erosion model developed by Perry Peterson in the context of the Federated Marine SDI (FMSDI) Pilot Phase 3 scenario as an OGC API — Processes execution request leveraging the Part 3: Workflows and Chaining extension. Ecere also mapped this workflow to a flow diagram, providing a clear visual illustration of how the different data sources are integrated. For the FMSDI pilot, Ecere provided a DGGS client accessing and visualizing the output from the coastal erosion susceptibility workflow executed on the University of Calgary’s DGGS Data Integration Server, triggering on-demand processing. This DGGS server at the University of Calgary quantizes and samples the data to an Icosahedral Equal Area aperture 3 Hexagonal (ISEA3H) DGGS, parameterizes the data for the erosion model, and finally serves the resulting collection through a DGGS API.
The workflow integrated vector features and coverage data from four different sources:
ArcticDEM (https://www.pgc.umn.edu/data/arcticdem/ – NGA)
Global Land Cover (https://www.usgs.gov/centers/eros/science/usgs-eros-archive-land-cover-products-global-land-cover-characterization-glcc – USGS)
Geologic map of Alaska (Surficial Geology) (https://www.usgs.gov/centers/alaska-science-center/science/geologic-map-alaska – USGS)
Circum-Arctic permafrost and ground ice (https://nsidc.org/data/ggd318/versions/2 – NSIDC)
The client accessed the data values using a dual DGGS of ISEA3H that was named ISEA9R in the FMSDI project, using rhombic zones and an aperture 9. Other DGGS and reference systems, such as the ISEA3H PIXYS indexing or the GNOSIS Global Grid, or other access mechanisms such as OGC API — Tiles, Coverages or EDR, could also be used by clients to trigger the same workflow if the server would provide them.
As well as adding support for using the output of a process as an input to another, the Processes — Part 3 extension allows an OGC API collection to be an input of a process, or the output of a process or of the overall workflow.
Figure 8 — Visualizing the output of coastal erosion workflow Ecere’s GNOSIS Cartographer client (large scale view)
Figure 9 — Visualizing the output of coastal erosion workflow Ecere’s GNOSIS Cartographer client (closer view)
Figure 10 — Visualizing coastal erosion workflow definition as a flow diagram
The visualization also integrated additional datasets:
Blue Marble Next Generation (https://earthobservatory.nasa.gov/features/BlueMarble – NASA)
Black Marble (https://blackmarble.gsfc.nasa.gov/ – NASA)
GEBCO (https://www.gebco.net/news_and_media/gebco_2014_grid.html)
Viewfinder Panoramas (https://viewfinderpanoramas.org – Jonathan de Ferranti (SRTM / USGS and other sources)
Gaia Sky in Colour (https://sci.esa.int/web/gaia/-/60196-gaia-s-sky-in-colour-equirectangular-projection – ESA)
Ecere initiated the development of the necessary components to execute this workflow in its GNOSIS Map Server, such as the PassThrough process and the ability to specify field modifiers in execution requests. However, at the end of the initiative some additional work still needed to be completed, in particular regarding the ability to perform spatial join operations on both coverage and vector features. The efforts to complete this functionality and allow to execute the sample workflow shown below were ongoing at the time of publishing this report.
Once Ecere’s implementation of the coastal erosion workflow is completed, it will provide an opportunity to compare the output side-by-side with the one produced by the University of Calgary’s DGGS server, and validate the reproducibility of the workflow on two different implementations, each using a different Discrete Global Grid System (ISEA3H and GNOSIS Global Grid) to perform the data integrationg and calculations.
6.3.3.1. JSON Execution Request with CQL2
In this example scenario, expressions written using CQL2, the OGC Common Query Language, are used to map properties from the datasets to a numerical estimation of how much they could be contributing to erosion, and to specify final weights in order to output a single susceptibility percentage value.
{
"process": "https://maps.gnosis.earth/ogcapi/processes/PassThrough",
"inputs": {
"data": [
{
"process": "https://maps.gnosis.earth/ogcapi/processes/Slope",
"inputs": {
"dem": { "collection": "https://maps.gnosis.earth/ogcapi/collections/SRTM_ViewFinderPanorama" }
},
"properties": { "s" : "slope >= 36.4 ? 10 : slope >= 17.6 : 7 : slope >= 8.7 ? 5 : slope >= 3.5 ? 3 : 1" }
},
{
"process": "https://maps.gnosis.earth/ogcapi/processes/Aspect",
"inputs": {
"dem": { "collection": "https://maps.gnosis.earth/ogcapi/collections/SRTM_ViewFinderPanorama" }
},
"properties": { "a" : "aspect >= 315 or aspect < 45 ? 1 : aspect >= 225 or aspect < 135 : 5 : 10" }
},
{
"collection": "https://maps.gnosis.earth/ogcapi/collections/ArcticPermafrost",
"properties": {
"e" : "extent = 'c' ? 1 : extent = 'd' ? 5 : extent = 's' ? 7 : extent = 'i' ? 10 : null"
"c" : "content = 'l' ? 1 : content = 'm' ? 5 : content = 'h' ? 10 : 0"
}
},
{
"collection": "https://maps.gnosis.earth/ogcapi/collections/Landsat7LandCover",
"properties": { "l" : "lc in(0,1,2,3,4,5,11,13,15) ? 1 : lc in(6,7,8,9,10,12,14) ? 5 : lc = 16 ? 10 : 0" }
},
{
"collection": "https://maps.gnosis.earth/ogcapi/collections/AlaskaSurficialGeology",
"properties": {
"g" :
"qcode = 'Ql' ? 0 : qcode in ('Qra','Qi','Qrc', 'Qrd', 'Qre') ? 1 : qcode in ('Qrb','Qaf', 'Qat', 'Qcb','Qfp','Qgmr') ? 3 : qcode in
('Qcc','Qcd','Qel','Qm1', 'Qm2','Qm3','Qm4','Qw1','Qw2') ? 5 : qcode in ('Qes','Qgm') ? 7 : qcode in ('Qed','Qgl','Qu') ? 10 : 0"
}
}
]
},
"properties": { "susceptibility" : "0.30 * s + 0.05 * a + 0.05 * e + 0.20 * c + 0.10 * l + 0.30 * g" }
}
6.3.4. Crops Classification Workflow
Ecere originally intended to demonstrate reproducibility in the context of a crop classification workflow previously developed for the 2020-2021 Modular OGC API Workflows (MOAW) project led by Ecere in collaboration with several other OGC members and other organizations, for which financial support was provided by Natural Resources Canada’s GeoConnections program.
A major difficulty faced with this workflow was reliance on input sentinel-2 data Level 2A data provided from an OGC API — Coverages end-point. A persisting issue with the EuroDataCube implementation that previously provided this capability prevented its use. As an alternative, Ecere spent efforts on improving the data cube capabilities of the GNOSIS Map Server to serve that purpose, with a goal to provide efficient access to sentinel-2 data sourced from Cloud Optimized GeoTIFF and cataloged with STAC metadata. The development of those capabilities presented its own set of challenges, described below, and still remained to be completed at the end of Testbed 18. The ability to demonstrate the crop classification workflow was therefore still pending the completion of those datacube capabilities.
The output from previous executions of this workflow as well as training data are shown below. The workflow uses a Random Forest machine learning prediction algorithm to classify crops based on Earth Observation imagery from ESA sentinel-2 across multiple seasons.
Figure 11 — Crop Classification Workflow output visualized in Ecere’s GNOSIS Cartographer client (from 2020-2021 MOAW GeoConnections project)
Figure 12 — Crop Classification Workflow output visualized in QGIS client (from 2020-2021 MOAW GeoConnections project)
Figure 13 — Parcels training data for crop classification workflow (from 2020-2021 MOAW GeoConnections project)