I. Executive Summary
The Open Science Persistent Demonstrator (OSPD) is a collaborative initiative led by the Open Geospatial Consortium (OGC), in partnership with the European Space Agency (ESA) and the National Aeronautics and Space Administration (NASA). Its primary goal is to advance open science by enabling reproducible Earth Science research across global communities.
Key Objectives:
Promote Open Science: Facilitate broader access to Earth Observation (EO) data and tools, encouraging cross-disciplinary research and informed decision-making.
Enhance Interoperability: Test and demonstrate the integration of diverse Earth Observation and Earth Science cloud technologies and infrastructures developed by ESA, NASA, and other international organizations.
Develop a Persistent Demonstrator: Create a 24/7 web application that showcases scientific workflows across multiple platforms, demonstrating how different organizations’ platforms can be utilized for collaborative research and data representation.
The Open Science Persistent Demonstrator (OSDP) project is a multi-year project, with just the first completed in late 2024. OSPD brought four essential elements together.
OGC Standards that enable interoperability between platforms, platforms and applications, and applications with data sets
A set of international Earth observation platforms, including
CRIM provides federated platforms for climate data analysis with tools for geospatial data storage, visualization, and modeling using advanced interoperability standards.
GeoLabs offers an open-source geospatial processing engine, ZOO-Project, supporting OGC standards for deploying workflows and integrating with high-performance computing resources.
Terradue delivers on-demand processing services for satellite data applications, such as flood mapping, using OGC-compliant APIs and workflows.
Ellipsis Drive, a scalable geospatial data storage and sharing platform that supports multiple OGC protocols for seamless dataset access and management.
PolarTEP, a European Space Agency platform enabling remote data access, interactive development, and machine learning tools for polar science research.
Terrabyte (DLR), a high-performance data analytics platform hosted by the German Aerospace Center, offering extensive computational resources for Earth observation workflows.
Development Seed implemented a NASA VEDA (Visualization, Exploration, and Data Analysis) instance for EO data ingestion, visualization, and STAC catalog integration to support interoperable workflows.
Open Science Studio, facilitates the creation and sharing of APIs from Jupyter Notebooks, promoting reproducibility and accessibility in research workflows.
openEO / Google Earth Engine, provides a standardized interface to connect various EO data sources with the computational capabilities of Google Earth Engine for reproducible workflows
iGUIDE provided the CyberGIS-Compute framework which provides transparent access to HPC resources, while enabling users to manage containerized computational models and tools that can be configured and launched on HPC through an interactive interface.
The Galaxy Project, a free open-source web application that supports users to create, reuse, and publish scientific workflows for the analysis, integration, and visualization of data. Galaxy was used in the project for scientific workflow orchestration and execution.
The Open Science Framework, OSF is a free online research platform designed to support researchers to openly and transparently share their work at all stages of the research project lifecycle, and facilitate collaboration, documentation, archiving of work, together with the sharing of research projects, materials, and data. OSF was used in the project as on online registry for all OSPD platforms, workflows, and applications.
All material, background information, an introduction to collaborative open science in general, and lessons learned is available from the project website.
II. Keywords
The following are keywords to be used by search engines and document catalogues.
open science, workflows, Earth observation, reusability, portability, transparency
III. Overview
Addressing today’s complex challenges requires Collaborative Open Science that enables integrity, provenance, and trust and fosters cross-domain integrations. However, building workflows and data flows that can operate across sectors remains technically challenging and resource-intensive, hindering whole-system change. Organizations in various sectors aspire to demonstrate accountability but often lack effective tools. The Open Geospatial Consortium (OGC) Open Science Persistent Demonstrator (OSPD) Pilot aims to promote collaborative open science. It facilitates responsible innovation linked to Earth Observation (EO) by simplifying the connection of data and platforms in transparent, portable, reproducible, standards-conformant workflows.
The OSPD promotes open science by embodying principles of reusability, portability, and transparency. Reusability involves consistent utilization of EO data and workflows, together with sector-specific data, across platforms to maximize efficiency. Portability ensures seamless transition of EO data and applications across platforms for broader applicability. Transparency provides clear insights into EO data processing, building trust within the community. Standards enable interoperability and quality assurance. Key work in the development of the OSPD includes implementing OGC standards for diverse platforms to make geospatial data and services available on a workflow platform, utilizing templates to document services and workflows, and providing linked learning materials for user accessibility. The OSPD is built around four key components: a community of geospatial computing platforms that utilize OGC Standards-conformant data and services; Galaxy, which enables users to build and test cross-platform workflows; The Open Science Framework (OSF), which serves as an archive and discoverability hub; and OGC Standards such as OGC API-Processes to enable cross-platform interoperability.
IV. Future Outlook
Important planned features of the OSPD support essential open science actions such as assigning persistent identifiers, writing documentation, implementing version control, standardizing data formats and APIs, transferring licenses and metadata, and ensuring cross-platform portability and testing. Use cases developed for the OSPD demonstrate how it supports users in composing and documenting open science workflows across multiple platforms and illustrate the value of open science, not only for research, but for all organizations that need to provide transparent data-driven decision making or serve in positions of public trust.
V. Value Proposition
The OSPD generates value for organizations that provide geospatial data and services by improving their discoverability and usability. Organizations will have greater impacts with their data and applications, because OSPD enhances the visibility of all elements by connecting data with descriptions, platforms with services, and users with applications. It provides value to science and research users by facilitating discovery of open standards-based geospatial tools and services, supporting them to maintain data integrity and provenance when reusing them in novel workflows and applications, and accelerating research by enabling reuse of well-documented standards-based tools.
1. The OSPD Initiative
The OGC Open Science Persistent Demonstrator (OSPD) initiative will be carried out over several years as part of the OGC Collaborative Solutions and Innovation Program. This report discusses the results of the first phase. To ensure the most sustainable development possible, some parts, for example the Clause 2.2, are defined across phases. In the first phase, they primarily serve to provide orientation and a framework and will be fully implemented in the further course of the initiative.
2. Introduction
2.1. Aims
Collaborative open science is essential to addressing complex challenges whose solutions prioritize integrity, provenance, and trust, and require cross-domain integrations. Today, building workflows, processes, and data flows across domains and sectors remains technically difficult and practically resource intensive, creating barriers to whole-systems change. While organizations in the public and private sector, as well as in research, increasingly aim to demonstrate accountability, they often lack the tools to act effectively. The Open Geospatial Consortium (OGC) Open Science Persistent Demonstrator (OSPD) aims to promote collaborative open science and enable responsible innovation linked to Earth Observation (EO) by making it simple to connect data and platforms together in transparent, reusable and reproducible workflows.
The OSPD’s design enables reproducible open science by creating practical mechanisms to put the principles of reusability, portability, and transparency into action.
Reusability, as envisioned by the OSPD, involves consistent utilization of EO data, processes, and scientific workflows across diverse platforms and sectors. By embedding this principle in tools, the OSPD aims to maximize the value derived from each segment of EO data, thereby promoting efficiency.
Portability underscores the need for EO data, applications, and insights to transition seamlessly across various platforms, ensuring broader applicability and flexibility.
Transparency facilitates a clear view into the mechanisms of EO data processing and use. By offering a transparent system, the OSPD endeavors to foster trust and provide clarity for its users, stakeholders, and the broader EO community.
This report provides an overview of the principles and values guiding the development of the OSPD, the initial design of the OSPD, lessons learned from early design and prototyping experiments, and plans for the next phase of OSPD development.
2.2. Scenarios
The OSPD Pilot series is designed foremost as a tool for technical professionals who are developing tools and systems to support scenarios where integrity, provenance, trust, and cross-domain integrations are required. Two example scenarios are provided below to illustrate situations in which a technical user would be developing a tool, model, or system to meet these requirements. The scenarios will be implemented and refined step by step during the current and future phases of the OSPD Pilot.
2.2.1. Scenario 1 — Water quality degradation due to harmful algal blooms
Problem Overview
Harmful algal blooms (HABs) refer to the overgrowth of algal species, which can have harmful effects on human and animal populations . HABs occur naturally and are not always harmful and the degree of harm and damage to humans or animals is caused by the type of species involved in the bloom and the type of toxins released (Gobler).
Human activity is contributing to more algal blooms throughout the planet, mainly through the pollution of waterways with nitrogenous and phosphate-rich waste from agricultural and industrial activity, inadequate wastewater treatment and road runoff (Guo et al). Warmer weather associated with global climate change is also contributing to the greater frequency of algal blooms worldwide (Gobler et al).
When harmful, the production of toxins by cyanobacteria in freshwater and brackish water systems and dinoflagellates and diatoms in marine water systems can lead to dead zones killing off all flora and fauna (NIH). On humans, these toxins can lead to neurological, gastrointestinal, and respiratory effects as well as affect skin and tissue. HABs also have economic implications as fisheries, wildlife and recreational activities, and tourism are impacted.
User Personas
Health Departments: As a state or local health agency, I want continuous monitoring of cyanobacteria levels in bodies of water within my area jurisdiction because both sampling (expensive, labor-dependent) and community reporting are inconsistent and unsustainable.
Federal Health Authorities: Federal health authorities and agency personnel will want continuous monitoring of bodies of water across the U.S. so that I can provide this as a service to state health authorities. Government agencies such as NASA, the ESA, NOAA, USGS, HHS, the EPA and/or others may wish to task satellites to provide continuous monitoring of cyanobacteria levels in bodies of water of particular interest on behalf of state and local health authorities who may not be able to afford to do so.
Health Systems: As a health system/hospital, I want to know about the presence (location, type, extent) of an algal bloom so I can prepare as possible to provide care to the infected.
Water Companies: As a water company, I want to know about algal blooms along with any issues that may impact any body of water I may be using as a source for providing drinking water.
Fisheries industry: As a Fishery, I want to know about algal blooms along with any issues that may impact any body of water I may be using as a source of seafood.
The Public: As a member of the public, I want to know how scientists and policy makers understand and make decisions about algal blooms and determine which bodies of water are potentially unsafe, and the impacts in my community.
Open Science Requirements
MUST:
Correctly identify current algal blooms over 90% of the time (90% accuracy)
Consistently identify algal blooms given the same parameters (99% consistency)
Provide transparency in methodology used
Technical details
Scientific rationale
Contain a mechanism for ingesting feedback from expert users
Contain a mechanism for incorporating useful feedback into system updates
SHOULD:
Exhibit enough repeatability and reproducibility to be a trustworthy tool for public health experts
Be timely in its output
Provide output that is useful for intervention
Have a user interface that requires minimal onboarding and training for tasks such as finding and reproducing workflows
Have a platform that is sustainable by non-experts over a long term
Have a structure for updates and review of data in a cyclical and known time frame
COULD:
Contain links and open resources related to the material for further education
Be combined with other monitoring platforms and tools by expert users, taking OSPD workflow outputs into external analytical tools
Contain data that can be downloaded by expert users
Exist as both a desktop version and a mobile app version
WOULD:
Would like to have the platform be visually engaging and user friendly
Would like to have the platform have either a version that is user friendly towards the lay public or sections specifically designed for the lay public
Would like to have the input of the lay public as well
Ex: swimmer’s clubs, habitual beach users, commercial fishing
Would like to advertise our work in the scientific literature and scientific spaces
2.2.2. Scenario 2 — Water quality degradation due to floods and droughts
Problem Overview
Natural disasters like floods and droughts severely impact water sources, leading to significant challenges in water consumption and safety. Floods can contaminate water with pathogens, pesticides, pollutants from factories, and heavy metals by overwhelming treatment facilities and inundating agricultural lands, while droughts reduce water availability, concentrating pollutants and exacerbating competition for scarce resources. These events compromise the quality and availability of drinking water, posing risks of waterborne diseases such as diarrhea, cholera, and hepatitis, alongside long-term health effects from chemical contaminants. The socioeconomic implications of these disasters are profound, amplifying inequalities in water access, particularly affecting marginalized communities. Recovery efforts are often costly and lengthy, exacerbating water scarcity issues. Addressing these challenges necessitates resilient water management strategies, infrastructure improvements, and community-focused initiatives to ensure equitable access to clean and safe water for at-risk aged groups.
User personas
Emergency Response: As a representative of the emergency response community, I want to know the risks to drinking water as both access to water is important in a disaster (e.g., fire) as well as protecting water from contamination.
Health Departments: As a state or local health agency, I want to know if any water resource is compromised so that we can issue the necessary alerts (e.g., boil advisory) and begin remediation.
Health Systems: As a health system/hospital, I want to know about unsafe drinking water so I can prepare as possible to provide care to those who have consumed unsafe water.
Water Companies: As a water company, I want to know about unsafe drinking water so we can issue the necessary alerts (e.g., boil advisory) and begin remediation.
The Public: As a member of the public, I want to know how scientists and policy makers evaluate and make decisions about the status of my drinking water, both to be able to understand how their actions are designed to protect my health, and impacts of this in my community.
Open Science Requirements
MUST:
Correctly identify current contaminated water reservoirs over 90% of the time (90% accuracy)
Consistently identify contaminated water given the same parameters (99% consistency)
Provide transparency in methodology used
Technical details
Scientific rationale
Contain a mechanism for ingesting feedback from expert users
Contain a mechanism for incorporating useful feedback into system updates
SHOULD:
Exhibit enough repeatability and reproducibility to be a trustworthy tool for public health experts.
Be timely in its output
Provide output that is useful for intervention
Have a user interface that requires minimal onboarding and training
Have a platform that is sustainable by non-experts over a long term
Have a structure for updates and review of data in a cyclical and known time frame
COULD:
Contain links and resources related to the material for further education
Be paired with other monitoring platforms and tools by expert users
Contain data that can be downloaded by expert users
Exist as both a desktop version and a mobile app version
WOULD:
Would like to correctly identify current contaminated water reservoirs over 100% of the time (100% accuracy)
Would like to have the platform be visually engaging and user friendly
Would like to have the platform have either a version that is user friendly towards the lay public or sections specifically designed for the lay public
Would like to have the input of the lay public as well
Would like to advertise our work in the scientific literature and scientific spaces
2.3. Technical Requirements
Technical users of the OSPD might include the staff of small businesses working on contracts with public agencies, data engineers or analysts in public agencies or non-profits working in the public interest, or university researchers engaged in use-inspired or applied projects. For technical users like these to implement tools that can address scenarios like these, the OSPD needs to address the following requirements.
I need to discover algorithms that can be used, alone or in combination with other algorithms, to address my problem.
I need to discover scientific literature that provides evidence that the algorithms are validated (peer reviewed) and robust.
I need to discover platforms where these algorithms can be run using the data I need to address my problem.
I need practical instructions on how I can reuse the algorithms I found to address my problems.
I need to be able to discover and use specific implementations (versions) of the selected algorithms.
I need to design and build a workflow that uses a combination of multiple algorithms to address my problem.
I need to reuse algorithms on their respective platforms and to be able to change paramenters in them.
I need to combine several algorithms on the same platform into a workflow by using the output of one algorithm as the input to another.
I need to combine several algorithms on different platforms into a workflow by using the output of one algorithm as the input to another.
I need to take the results of a workflow and use them in subsequent analyses.
I need to share my workflow and analysis so that someone else can reuse them with confidence.
I need the workflows I build to be portable to other environments, including my own organziation’s infrastructure.
I need to run an experiment on a different platform where I have the necessary credits and credentials to execute a workflow.
2.4. Objectives
To achieve its aims, the OSPD is developing a distributed cyberinfrastructure with four key components. OGC standards (1) (OGC API-Processes and -Features) are implemented to enable diverse platforms (2) to make their geospatial data, processing and analytical services available on a workflow building and execution platform, Galaxy (3), so that researchers can easily reuse and remix them in their own workflows and port them to new platforms. Templates are implemented in osf.io (4), a domain-agnostic trusted digital repository, to guide providers of data and analytical services and creators of Galaxy workflows to document their services, workflows, and outputs, so that they are more findable and transparent. The OSPD is creating linked learning and outreach materials that make participating platforms visible and accessible to a wide range of users. Use cases demonstrate how the OSPD supports its users to compose and document open science workflows leveraging diverse EO platforms.
Figure 1 — Figure The Open Science Persistent Demonstrator as a place to discover, prototype, share, and archive geospatial services and workflows
The currently implemented and future planned features of the OSPD support essential open science actions, including the following.
Assigning DOIs or UUIDs: Persistent identifiers like Digital Object Identifiers (DOIs) ensures entities are persistently accessible and citable, embedding them within the broader discourse of linking data-code-documentation and cross-indexing.
Writing Documentation and Intended Application Good Practices: Describing data and application workflows using well-documented practices, metadata, and templates facilitates anyone, regardless of prior knowledge, in engaging with a dataset or workflow and aids those looking to extend or adapt work for new contexts.
Implementing Version Control Systems or Containerization: These technologies enable tracking and recall of versions of algorithms, parameters, and configurations.
Implementing Standardized Data Formats and APIs: These formats and tools ensure platforms interact seamlessly, promoting effective interoperability.
Transferring licenses and metadata: Collection and passing all metadata, data and code licenses used between platforms following a consistent structure enables reuse.
Cross-Platform Portability and Testing: Regular testing across platforms ensures that inconsistencies are identified and addressed and results are consistent.
2.5. User Journeys
Based on the use cases above, two key user groups were identified who would engage with the OSPD. The first user group is platform providers, represented below by the user journey of a developer named Grace. Scientists and researchers comprise the second user group, and are represented by Lila’s user journey. The training materials developed for this project are designed to support each group of users on their journey through the OSPD.
The following user journey describes Grace, a developer at CODEXYZ Ltd, who wants to create an open science workflow within the OSPD demonstrator ecosystem for scientists and researchers to use freely. CODEXYZ has built their own in-house processing platform, and uses this to provide commercial products and services to their clients. The algorithms, data, and processing methodologies are considered CODEXYZ’s intellectual property, but Grace is able to use the OSPD to create a freely accessible workflow that is made interoperable through implementation of OGC API — Processes.
Figure 2 — Platform Provider User Journey
The next user journey describes Lila, a scientist who wants to conduct research focused on the health impacts of climate change. Lila doesn’t have any programming experience and has limited access to geospatial data processing infrastructure. Despite these limitations, she is able to leverage the OSPD to identify a research project, find and analyze relevant geospatial datasets, and write a successful application for research funding.
Figure 3 — Scientist User Journey
3. Technical Features
3.1. Discoverability and archiving
The OSPD builds on the discoverability and archiving capabilities of the OSF.io platform, which are complemented by the workflow building capabilities of Galaxy. The OSF.io platform supports discoverability both by encouraging creation of robust metadata through registration templates, which prompt contributors to add tags, and through its faceted search. OSF.io is a Trusted Digital Repository. An OSF Registry provides a permanent, transparent, easily accessible repository that enables the archiving, sharing, searching, and aggregating of funded study plans, designs, data, and outcomes. Researchers can create robust, timestamped registrations of research projects
Metadata plays a crucial role in enhancing the discoverability of these materials both on and off the platform. By providing detailed metadata during the registration process, researchers improve the visibility of their work to other scholars, institutions, and the broader academic community. Rich metadata, including title, contributors or authors, keywords, subjects, and licensing information, allows for more accurate and efficient searching and browsing, enabling researchers to locate relevant materials more effectively. After the registration is archived, researchers have the flexibility to append additional metadata such as the type of information included in the registration, the language in which it is written, details about the funding agency, award title, award number, and URI. Moreover, researchers can seamlessly link other pertinent research materials, whether they reside on or off the OSF.io platform, to the registration via a DOI.
Figure 4 — Registrations are the basic archival entity in the OSF system. The OSPD will develop customized templates to capture consistent metadata about platforms, services, workflows and workflow instances.
Metadata serves as the backbone of discoverability within OSF’s search page. By leveraging the rich metadata associated with research materials, OSF’s search page empowers researchers to pinpoint relevant information efficiently. Through the creation of filters based on the metadata mentioned earlier, users can fine-tune their search queries to align with their specific research interests and objectives. This granular level of filtering not only narrows down search results but also ensures that researchers are presented with highly relevant and contextually appropriate materials.
Figure 5 — The OSF supports the discovery of registrations and other entities based on their metadata both through free text and faceted search.
The OSF is actively restructuring their integrations platform to allow for a more robust set of integrations led by research communities. The CEDAR (Center for Expanded Data Annotation and Retrieval) Workbench is one of several upcoming integrations that expands metadata further to include key information that’s important for different research communities. The restructure will make other integrations, such as that with Galaxy and others, easier.
Figure 6 — The design of metadata templates in the OSF for the OSPD will be based on input from both platform providers and scientist end users involved in the development of the OSPD.
Metadata also serves as a bridge between the OSF.io platform and other scholarly resources by making information accessible through Application Programming Interfaces (APIs). APIs allow external platforms and services to interact with OSF.io data, enabling seamless integration and interoperability with a wide range of research tools and systems. By exposing metadata through APIs, OSF.io promotes data sharing and facilitates the exchange of research information across diverse platforms, ultimately enhancing collaboration and accelerating scientific progress.
OSPD users will be expected to register four types of entities in the osf.io system: platforms, services, workflows and workflow instances. Platforms are independent providers of geospatial data, search, processing, analytical or modeling services who can connect with the OSPD through use of OGC Standards conformant APIs and formats. Services are the specific data, search, processing, analytical or modeling capabilities provided by platforms to the OSPD using OGC Standards conformant APIs and formats. Workflows are the templates for operations that use one or more than one component (where a component is a service or data) to accomplish a task such as the execution of a multi-step scientific analysis or modeling of a scenario. Workflow instances are executed workflows where specific data, services and parameters have been used to generate an output.
Platform providers who are contributing services (data, search, analysis) to the OSPD will be expected to document their platform as an entity, enabling OSPD users to discover multiple services provided on a single platform and see the set of platforms in the OSPD ecosystem. Platform providers will also be expected to document the specific services they provide, which will be linked from both platforms and workflows. Scientist users of the OSPD will be expected to document workflows they build on Galaxy to enable their reuse by others and their discovery. Scientist users will also be able to document and archive workflow instances — runs of a specific workflow using specific platforms and data — on osf.io, enabling transparency. The templates specifying the required metadata and documentation for each entity are being developed during the current phase of the OSPD Pilot.
3.2. Workflow building, testing and execution features
To meet the use cases’ requirements, the OSPD selected Galaxy. Galaxy provides a toolset for building, testing, and executing workflows that combine data and analytical tools. These tools may be provided by any platform with services conformant to OGC API Processes. The Galaxy platform is a core component of the OSPD architecture, acting as a hub for running and orchestrating workflows composed of processes hosted on one or several independent platforms.
3.2.1. Key Features
Galaxy is a flexible and extensible open source web application to assist researchers in publishing and reusing reproducible workflows. The platform offers users a large number of tools to accomplish specific tasks, such as data manipulation (e.g., adding a column to an existing dataset), data analysis (e.g., running a statistical computation), or data visualization (e.g., creating figures). These tools can be combined into readily sharable, well-documented workflows. Galaxy is widely used in the research community, with over 50,000 users from over 100 countries as of 2023. Because Galaxy tools and OGC API Processes both use an input-processing-output logic for building workflows, integration of OGC API Processes into the Galaxy platform can enable the combined use of standards-conformant geospatial data and analytical tools and leverage those contributed by other researchers using Galaxy.
Figure 7 — Multiple services on different platforms using OGC API Processes are integrated into a single workflow on Galaxy, as shown in the high-level schematic.
The challenge of the OSPD is to explore how to develop the necessary integrations to maximize usability and value for platforms providing data and analytical and modeling tools and researchers using Galaxy to build and execute workflows for open science applications.
Through discussions with the Galaxy community, Galaxy’s core developers, and the partners involved in the OSPD project, the following key requirements for integrating OGC API Processes were identified.
The services provided by partner platforms should still run on their infrastructure.
Since geodata can be large, data transfer to and from Galaxy should be avoided.
Users should be able to configure the input parameters via the Galaxy user interface.
Users should be able to connect services in Galaxy to create workflows.
Figure 8 — Results are passed as files containing URLs to enable data to remain at rest and keep processing on partner platforms’ compute infrastructures.
Based on these requirements, several integration options were outlined. The Wrapper option described below stands out for its capability to meet the requirements, and was preferred by the OSPD platform providers, representing a key stakeholder community. The strengths, weaknesses, opportunities, and threats of use of the Wrapper option were explored in detail, described below, together with brief discussion of alternative options.
3.2.2. Integration options
3.2.2.1. The Wrapper option
The idea of the Wrapper option is to wrap the OGC API Processes in a Galaxy tool and enable a tool-service communication flow as described in Figure 1X. The tool (Galaxy service wrapper) runs on the Galaxy platform and collects all required input parameters from the user. The tool then passes the parameters to the process and executes it via its API. Through the job ID that was received from the process, the tool requests the job status in regular intervals and fetches the result once the process finishes successfully. An example in-development wrapper integration is available on GitHub in a Galaxy fork1. We anticipate that most platforms will pursue this option to integrate with Galaxy.
Figure 9 — Communication flow between the wrapper and the OGC API Process.
3.2.2.2. Further integration options
Re-implementation: Another option to integrate an OGC API Processes in Galaxy as a tool is to re-implement it. The advantage of this option is it does not require a remote server, relevant if a server is not available or could be shut down, or if maintaining an independent external platform is beyond the capacity or needs of the organization that created the tool. This option can also be used to check the tool integration process for robustness. Disadvantages of this option are that re-implementing a process in Galaxy is time-consuming and not useful if service providers want to see their own infrastructure in use, and that transferring larger datasets to Galaxy might be necessary. We anticipate that this option would be pursued only by organizations not in a position to maintain their own platform and that, in practice, it will be rarely used for the OSPD.
Containerization: The final option is to integrate the Application Package directly in Galaxy. The Application Package includes the service containerized application and all the necessary service’s metadata. This might provide a useful and easy-to-maintain option in cases where it is already containerized, but otherwise this option requires additional effort to create a dockerfile. We anticipate that some platforms may choose to pursue this option in future phases of the OSPD.
3.2.3. SWOT Analysis
Strengths: The Wrapper option comes with a number of benefits. First, the service is executed in Galaxy but the computations run on the provider’s infrastructure. This is particularly important if the OSPD infrastructure is based on a shared Galaxy instance (e.g., the European Galaxy server) which has limited resources because they are managed centrally and allocated across multiple projects and users. The providers have more control over the server hosting the process and can add further resources if needed. Second, to avoid heavy data transfer (e.g., in the case of big data), in the example implementation of the Wrapper option the URL to the result is stored in a .txt file, which can be used as an input to another wrapper in a workflow. Since we only need to send a list of URLs linking to the input datasets from one process to the other, a .txt is a suitable data format. Finally, the wrapper is a lightweight implementation and the service remains untouched.
Weaknesses: A limitation of the Wrapper option is its scalability to several hundreds or even thousands of services. While it is possible to create a template that simply needs to be completed with the corresponding process information resulting in one Galaxy tool per process, there is still too much manual work required. Notably, beyond completing the template, one would also need to onboard every tool (i.e., the wrapper of one service) to Galaxy, which requires a set of pull requests and human review by Galaxy developers which cannot be automated. To mitigate this issue, we implemented a generalized solution, the OGCProcess2Galaxy tool, to wrap a set of processes into one Galaxy tool. For each server, the tool fetches all processes via GetCapabilities from the OGC API Processes Provider (see Figure 2). If not all processes are needed, it is possible to indicate which processes should be included or excluded using a configuration file.
Then, for each process, all necessary information including metadata, inputs, and outputs can be requested via ProcessDescription to build the Galaxy tool. First tests based on the servers provided in the OSPD project showed that the tool can generate a template which requires additional manual work in relation to the variable and parameter attributes, and the communication processes between Galaxy and the workflow. This is because although the processes are standardized they can be implemented in different ways and may have aspects (edge cases) which are difficult to implement generically. The final tool will be reviewed by the Galaxy community for security and operational checks and, if the review is successful, made publicly available. Updates in the list of servers would require an update of the Galaxy tool. However, such minor updates do not require the same review process as it is needed for a new tool.
Figure 10 — Figure Description of how to create a generic wrapper.
Opportunities: The Galaxy platform originated in the life sciences community but the Geo user community on Galaxy is growing. The integration of OGC API Processes in Galaxy can increase the awareness of OGC APIs and standards within and beyond the Geo domain. A successful integration can demonstrate the interoperability of OGC API Processes across platforms and, critically, across scientific domains. Moreover, users who are unfamiliar with OGC API Processes or do not have the skills to work with the API receive a ready-to-use application as an entry point to OGC API Processes.
Threats: One risk is the frequently occurring issue of maintenance and the question of who has responsibility for maintaining different elements of the OSPD. Updates in the OGC API can break the Galaxy tool and make it unusable until it receives an update. However, this issue is not specific to the OGC API — Galaxy integration but a general problem in software development. Furthermore, it is debatable whether it is better to have a “stable” tool where changes in the service do not result in a broken but potentially irreproducible tool, or to have a tool that breaks and, hence, does not provide potentially irreproducible results.
The generalized approach has some risks. Notably, it may prove difficult to spot incorrectly implemented edge cases in an operational tool. While creating tests in a Galaxy tool similar to unit tests might address common errors, some issues in a certain process will only become visible upon usage. Another potential risk comes from the quality of the descriptions of the OGC API Processes. A generic tool will strongly rely on high quality process descriptions, which must be created and maintained by platform and service providers.
3.2.4. Open Design questions
How to apply the Wrapper option to OGC API Features?
While further research is needed to investigate how OGC API Features can be integrated in Galaxy, it is broadly expected that the challenges of integration of further OGC APIs will reflect those encountered in integrating OGC API Processes. Testing of an integration workflow for OGC API Features will commence in the second part of the current phase of the OSPD.
How to inform users about service updates?
Updates in the services may or may not break the Wrapper tool. Regardless, to meet the OSPD’s Reproducibility requirement, users need to be made aware of any updates in the services and these changes must be reflected in the services OSF.io documentation. Options for displaying and transmitting this information require further exploration.
3.3. Platforms Overview
The third part of the OSPD infrastructure is the community of platforms contributing OGC standards conformant services to the OSPD. Details on each platform’s contributions and work are summarized in the OSPD Community Platforms — Detail section.
Platforms play a key role in the OSPD by providing the data, data cataloging, processing, analysis, and modeling services as components which can be incorporated into workflows on Galaxy. They remain independent of Galaxy and provide compute infrastructure and storage, enabling the OSPD to adhere to the ‘data at rest’ principle, minimizing the computational costs of large data transactions. Platforms are responsible for maintaining or hosting services conformant to OGC standards and any additional requirements for integration into the OSPD, notably minimal standards for metadata and registration of services and platforms in the OSPD osf registry. The OSPD is an open infrastructure and further platforms will be able to join and add services that are conformant to the OGC API standards supporting the system.
Descriptions of the developments by individual platforms participating in the current phase of the OSPD are included in an annex.
3.4. Design Considerations
Some design or implementation choices have implications across multiple components and require coordination beyond the conformance to OGC standards that otherwise enables interoperability across the OSPD.
Platform Capabilities and Application Reproducibility
Authentication
openEO
3.4.1. Platform Capabilities and Application Reproducibility
As seen above, platforms serve as comprehensive environments offering interfaces and tools for processing and utilizing EO data. These platforms enable developers to not only test and execute their applications but also to deploy and share them with others for individual or integrated workflow usage.
At their core, these platforms facilitate the deployment and execution of Application Packages formatted in the Common Workflow Language (CWL) as defined by OGC OGC 20-089, each defined by unique parameters and process descriptions. The use of Application Packages ensures the portability and reproducibility of workflows across different execution platforms as it encapsulates the entire workflow environment, including all necessary software components and dependencies. These platforms, when integrated with cloud computing resources, efficiently manage and execute user-requested data processing tasks, eventually returning the processed information.
With the introduction of Reproducible FAIR Workflows, there is a need to adapt these platforms to fully address the reproducibility of the application requests and their parameters with retrospective provenance. This includes detailed documentation of each process executed and the environment it operated in. CWLProv emerges as a solution to represent workflow-based computational analysis and its provenance at various levels. CWLProv, alongside the structured provenance from the W3C PROV Model, facilitates the creation of a workflow-centric Research Object (RO) that aggregates and shares resources. This object encompasses all aspects of a workflow from its initiation to final outputs. Its structure adheres to the BagIt format, which is a set of hierarchical file layout conventions that ensure reliable storage and transfer of digital content.
The BagIt-compliant CWLProv format includes not only the data generated by the workflow but also metadata detailing the workflow’s creation. This allows for the verification and replication of results, fostering an environment where applications are not only deployable but also transparent and reproducible.
3.4.2. Authentication
Many platforms restrict access to their resources to eligible users or have use constraints. Hence, authentication will be an issue not only when using one of the platforms but particularly whenever multiple independent platforms interact with each other. Users would need to authenticate at each platform separately before they can run a workflow which is a cumbersome process and in some cases difficult to manage, for example, if authentication information (e.g., cookies) expire after a few minutes. Furthermore, passing such information via tools like Galaxy can come with security issues.
The optimal approach to coordinating authentication requires further investigation and alignment between the platforms. OpenID Connect (based on OAuth 2) emerges as the option that is supported by most platforms, but is relatively complex to implement for platforms that don’t support it yet. Although Galaxy also provides [OpenID Connect](https://galaxyproject.org/authnz/config/oidc/), it is not yet clear how it can be used to authenticate at remote platforms.
3.4.3. openEO
The first major goal for OSPD is to enable users to interact with platforms implementing OGC API — Processes via Galaxy, which acts as a workflow builder and orchestrator. OSPD is meant to be modular, a design principle which is already expressed by supporting multiple platforms. However, there is only one instance of the communication protocol implemented (OGC API — Processes) and the ubuilder/orchestrator for workflows (Galaxy) in the initial design of the OSPD alternatives. To create a path for alternatives for these components, we are exploring the use of openEO in the first year of the OSPD project. This alternative path defines the openEO API and openEO processes as the communication protocol, both of which are an OGC Community Standard candidate. Similarly to the OGC API — Processes integration into Galaxy, the openEO API could also be integrated into Galaxy in future years. To provide an alternative module to Galaxy for the user interface/aggregator, the openEO Web Editor is being evaluated.
openEO is a project aimed at developing an open, standardized framework to connect various clients to big Earth observation cloud backends in a simple and unified way. By providing the standardized openEO API for workflows with pre-defined processes, openEO allows users to perform operations on Earth observation data across different cloud backends with minor changes to their code and algorithms and without needing to worry about the underlying complexity or specificities of each cloud provider. The openEO API aligns with the OGC APIs Standard baseline (especially Common) and STAC. openEO has implementations for various cloud backends, for example Copernicus Data Space Ecosystem (CDSE), openEO Platform, Sentinel Hub, VITO, and Google Earth Engine.
The openEO Web Editor is a browser-based graphical interface to connect to services that implement the openEO API. It allows a user to discover the offerings of the service, to create openEO workflows through a no-code environment, to manage the user-specific offerings of the service, and to visualize processing results (for D130). While it originates from the openEO community, it recently added basic support for OGC API — Processes.