Published

OGC Engineering Report

OGC Testbed 19 High Performance Geospatial Computing Engineering Report
Eugene Yu Editor Liping Di Editor
OGC Engineering Report

Published

Document number:23-044
Document type:OGC Engineering Report
Document subtype:
Document stage:Published
Document language:English

License Agreement

Use of this document is subject to the license agreement at https://www.ogc.org/license



I.  Executive Summary

Large-scale geospatial analytical computation is critically needed for tackling a wide range of sustainability problems, such as climate change, disaster management, and food and water security. However, such computation often requires high-performance computing (HPC) resources that are not easily accessible or usable by geospatial researchers and practitioners from various domains. To address this challenge, there is a need for developing and standardizing tools and interfaces that can bridge the gap between user frontend and HPC backends and enable effective and efficient use of High-Performance Geospatial Computing (HPGC) resources for geospatial analytics.

This OGC Testbed 19 Engineering Report (ER) presents the results of a testbed task that:

This ER provides an overview of the Testbed 19 motivation, objectives, scope, and methodology, as well as a summary of the main findings, recommendations, and future work directions.

CyberGIS-Compute is reviewed and used as a reference to develop the HPGC API. “CyberGIS-Compute is an open-sourced geospatial middleware framework that provides integrated access to high-performance computing (HPC) resources through a Python-based SDK and core middleware services.”[3] The OGC API — Processes[14] is adopted as the base API for standardizing and developing the HPGC API. A Python client library is developed to demonstrate the process of client generation by leveraging the OpenAPI client stub/model automatic generation capability[12]. Typical use cases and scenarios are demonstrated and scripted in Jupyter Notebooks.

II.  Keywords

The following are keywords to be used by search engines and document catalogues.

OGC, Testbed 19, high performance computing, high performance geospatial computing, application-to-the-cloud, testbed, docker, web service

III.  Contributors

All questions regarding this document should be directed to the editors or contributors.

NameOrganizationRole
Eugene YuGeorge Mason UniversityEditor
Liping DiGeorge Mason UniversityEditor
Sina TaghavikishOGCTask Lead
Furqan BaigUniversity of Illinois Urbana-ChampaignContributor
Gérald FenoyGeolabsContributor
Carl ReedCarl Reed and AssociatesContent Reviewer

1.  Introduction

The field of large-scale geospatial analytical computation has become increasingly vital in addressing a diverse range of sustainability challenges, including climate change mitigation, disaster management, and ensuring food and water security. Geospatial researchers and practitioners from various disciplines, such as geography, hydrology, public health, and social sciences, are actively engaged in utilizing geospatial analytics to derive valuable insights.

Advanced cyberinfrastructure and expertise in computer science have empowered large-scale computational problem-solving. However, expecting domain experts in geospatial-related fields to possess extensive technical knowledge to directly interact with high-performance computing (HPC) resources on advanced cyberinfrastructure is not realistic. Optimization of HPC resources for the geospatial community’s specific computational challenges requires a bridge between user frontends and HPC backends in the form of middleware tools.

To address this gap, designing and implementing middleware tools that enable seamless interaction between geospatial domain experts and HPC resources is critical. Such tools should abstract the complexities of HPC systems and provide standardized interfaces for effectively accessing, utilizing, and managing High-Performance Geospatial Computing (HPGC) resources. This undertaking necessitates substantial research and development efforts, along with the generalization and standardization of various aspects related to HPGC resource definitions and processing interfaces.

The objective of this Testbed 19 task is twofold: To evaluate previous and ongoing efforts in applying HPC to geospatial analytics and to develop initial standards for HPGC resource definitions and processing interfaces. By examining existing work in the field, identifying best practices, challenges, and opportunities for enhancing the utilization of HPC in geospatial domains is possible. The goal is to establish standardized guidelines that will facilitate the seamless integration of HPGC resources into geospatial analytical workflows, ensuring their efficient and effective use across diverse application domains.

Achieving this goal will foster collaboration between geospatial domain experts and HPC specialists, enabling a more streamlined and accessible approach to large-scale geospatial analytics. This ER outlines task findings, recommendations, and initial standards thereby providing a foundation for future advancements in the field of High-Performance Geospatial Computing.

2.  High Performance Geospatial Computing

This section will review the current status of High Performance Geospatial Computing (HPGC)[4][7][10].

NOTE:    Add any other clauses as needed

2.1.  Definition of High-Performance Geospatial Computing

High Performance Computing (HPC) refers to the application of advanced computing technologies and techniques to solve computationally intensive problems that require a large amount of processing power, memory, and storage. HPC often involves the use of clusters or supercomputers made up of thousands of interconnected processors and storage devices.

High Performance Geospatial Computing (HPGC) refers to the use of advanced computing techniques, tools, and systems along with geospatial data to solve complex problems related to geography, environmental science, natural resources, national security, healthcare, and other areas that rely on geospatial data analysis. In short, HPGC refers to the use of high performance computing (HPC) resources to solve complex geospatial problems. HPGC utilizes parallel processing, distributed architectures, cloud computing, and high-speed networking to accelerate data processing, modeling, simulation, visualization, and analysis[4][10][13][15]. This enables users to process, analyze, and interpret geospatial data to gain insights, solve problems, and make informed decisions more efficiently and effectively. HPGC systems are typically used to process and analyze large volumes of geospatial data, such as satellite imagery, aerial photography, lidar data, or to run large simulation models in spatiotemporal domains.

Table 1 compares and summarizes major steps of HPC and HPGC.

Table 1 — Major steps in HPC and HPGC

StepHPCHPGC
Problem formulationIdentify the problem that needs to be solved and formulate it in a way that can be transformed into a computationally intensive task including breaking the problem into smaller, more manageable sub-problems and defining the inputs, outputs, and constraints of the solution.Identify the problem specifically in geospatial domains, such as mapping and charting (e.g., creation of large area maps), disaster response (e.g., monitoring and emergency response to natural disasters, such as floods, hurricanes, and wildfires), and environmental monitoring.
Algorithm developmentDevelop an algorithm or set of algorithms that can efficiently solve the problem which includes designing a suitable computational model, selecting appropriate numerical methods, and optimizing the algorithm for parallel processing using techniques such as load balancing, data partitioning, and communication reduction.Develop specialized algorithms for solving geospatial problems including designing a suitable geospatial computational model, geospatial analytics, and optimizing geo-computing with parallelism, such as geospatial data partition, spatial indexing, and spatial optimized computing.
ProgrammingImplement algorithms in code that can be run on an HPC system, which includes using specialized languages and libraries, such as MPI (Message Passing Interface), OpenMP (Open Multi-Processing), CUDA (Compute Unified Architecture), or OpenCL (Open Computing Language).Implement geospatial algorithms which includes using geospatial libraries, such as Geospatial Data Abstraction Library (GDAL) and spatial projection library, and leveraging special clustering frameworks, such as GeoSpark[22][24].
Testing and debuggingTest and debug the code, which includes running the code on smaller test cases, comparing the results to analytical or experimental benchmarks, and identifying and fixing any errors or inefficiencies.Test and debug the geospatial algorithms implemented which includes verifying the algorithms against geospatial theories and geostatistical approaches and verifying the results geospatially from ground truth data or other verified data.
Execution and monitoringExecute the computational task, which includes monitoring the system to detect any abnormal behavior, diagnose and fix any issues that arise, and collect performance data for analysis and optimization.Execute and monitor the geo-computing task. Includes monitoring the progress, diagnosing the geo-computing partitions and reduction, and observing the progress with geospatial visuals or maps.
Post-processing and visualizationPost-process and analyze the results to extract meaningful insights, which includes sorting, filtering, and aggregating large amounts of data, as well as visualizing the results using tools such as graphs, charts, or maps.Post-process and analyze the geospatial results which includes scaling, spatiotemporal statistics, geostatistics, and visualization as interactive maps.

The main benefits of using HPGC are as follows.

  • Speed: HPGC systems can process large datasets of geospatial data quickly and efficiently which can save time and money, and can also help organizations make better decisions faster.

  • Accuracy: HPGC systems can be used to process geospatial data with greater accuracy than traditional methods, which can be important for applications such as climate modeling and earthquake prediction.

  • Scalability: HPGC systems can be scaled to handle larger and more complex datasets which allows organizations to keep up with the ever-increasing volume and complexity of geospatial data.

The major challenges of using HPGC are as follows.

  • Cost: HPGC systems can be expensive to purchase and maintain.

  • Complexity: HPGC systems can be complex to set up and use.

  • Expertise: Using HPGC systems effectively requires specialized expertise.

2.2.  Key Application Drivers

The following lists some of the key applications driving the utilization of high performance geospatial computing.

  • The increasing volume and complexity of geospatial data: The volume of geospatial data is exponentially growing and the data are becoming increasingly complex making the processing and analyzing of the data using traditional computing methods difficult.

  • The need for real-time processing: In many cases, it is necessary to process and analyze geospatial data in real time, such as for traffic management, disaster response, and national security.

  • The need for accurate and precise results: In many cases, obtaining accurate and precise results when processing and analyzing geospatial data is necessary, including applications for climate modeling, earthquake prediction, and wildfire prediction.

2.3.  HPGC Frameworks

CyberGIS-Compute, a middleware framework to bridge high performance computing and end users, was reviewed. It provides the base for the development of standard HPGC middleware/services that achieve the same capabilities of CyberGIS-Compute, but in a more standards based environment, which adopts widely-accepted specifications.

CyberGIS-Compute supports HPC job management (execution and monitoring) and collaborative workflow orchestration. A more detailed review of the CyberGIS-Compute framework and its relationship to the standards development process is available in the Appendix — HPGC Frameworks.

2.4.  Data-Intensive Geospatial Analytics for HPGC

HPGC can enable efficient processing of large volumes of data and data-intensive geospatial analytics. Examples of data-intensive geospatial analytics that can leverage HPGC include the following.

  • Spatial data clustering: Clustering is a technique to group similar objects together based on their spatial properties. Spatial clustering can be used to identify patterns and trends in large geospatial datasets, such as urban planning or regional development.

  • Geospatial machine learning: Machine learning algorithms, such as Random Forest[1], Support Vector Machine[2], and neural networks (including deep learning neural networks), can be applied to geospatial data to enable better prediction, classification, and regression analysis. For example, geospatial machine learning can be used in precision agriculture to predict crop yield, disease outbreaks, or identify soil types.

  • Geospatial image processing: High-resolution satellite imagery can generate large volumes of data. Analyzing such volumes of data is computationally intensive. HPGC can be used to analyze large volumes of geospatial imagery data and extract meaningful information such as land use, land cover changes, urban growth, or natural resources.

  • Geospatial simulation modeling: Simulation modeling involves the creation of mathematical models that emulate complex systems, such as traffic flow, pedestrian movement, water flow, or environmental simulation. Computational complexity often hinders simulation modeling using traditional computing techniques. HPGC can help in overcoming this challenge and help model complex systems affected by location.

  • High-Throughput Geospatial database management: Managing geospatial data requires specialized tools and techniques. HPGC can be used to optimize geospatial database management systems, from data integration, data warehousing, data indexing, and querying. This can empower real-time geospatial analytics across various industries such as city planning, transportation planning, or environmental analysis.

  • Geospatial Optimization: A typical geospatial optimization problem is routing, which uses spatial data such as traffic data, road network data, warehouse locations, and delivery routes to optimize the logistics and distribution of goods and services. Optimizing spatial configurations and resource allocation are computationally intensive. HPGC can be used to address computationally complex geospatial optimizations that require numerous iterations or combinatorial comparisons.

2.5.  Standardization of HPGC-based Data-Intensive Geospatial Analytics

HPGC has the potential to enable advanced and efficient geospatial analytics. Standardizing and making HPGC geospatial analytic capabilities accessible to the broader geospatial community can enhance knowledge sharing and enable better decision-making across different domains. The HPGC-based geospatial analytics that can be standardized and made accessible to the broader geospatial community include the following.

  • Geospatial Data Processing Pipelines: Standardizing the construction and execution of geospatial data processing pipelines can facilitate the integration and interoperability of HPGC workflows. Defining common data processing steps, input/output formats, and execution frameworks would enable users to easily exchange and share their geospatial processing pipelines across different HPGC platforms.

  • Large-Scale Spatial Data Analysis: Standardizing the algorithms and methodologies for large-scale spatial data analysis tasks, such as spatial clustering, spatial interpolation, or network analysis, can promote consistency and comparability of results across different HPGC implementations, enabling researchers and practitioners to leverage shared algorithms and approaches, reducing duplication of efforts, and fostering collaboration.

  • Geospatial Machine Learning Models: Standardizing the development and deployment of geospatial machine learning models can enhance reproducibility and interoperability. This includes standardizing the representation and serialization of trained models, input/output data formats, and evaluation metrics. Standardized geospatial machine learning models would allow users to easily share, validate, and integrate models into their HPGC workflows.

  • Geospatial Simulation and Modeling Frameworks: Standardizing the frameworks and interfaces for geospatial simulation and modeling can promote the exchange and integration of different simulation models. By defining standard interfaces, input/output formats, and simulation control mechanisms, researchers and practitioners can more easily collaborate, validate, and reuse geospatial simulation and modeling components.

  • Geospatial Image Analysis Workflows: Standardizing geospatial image analysis workflows can improve the accessibility and reproducibility of image processing and analysis tasks. Defining common data formats, preprocessing steps, feature extraction methods, and quality assessment metrics can simplify the adoption and sharing of HPGC-based geospatial image analysis workflows.

  • Spatial Data Fusion and Integration Techniques: Standardizing the methods and workflows for spatial data fusion and integration can enable the seamless integration of data from multiple sources. This includes standardizing data formats, fusion algorithms, data alignment techniques, and uncertainty modeling approaches. Standardized geospatial data fusion and integration techniques would facilitate data sharing and interoperability between different HPGC systems.

  • Geospatial Optimization Models and Solvers: Standardizing the formulation and solution approaches for geospatial optimization problems can promote the adoption and exchange of optimization models and solvers. Defining standard problem representations, optimization algorithm interfaces, and solution result formats would enable users to easily apply and integrate HPGC-based geospatial optimization techniques into their workflows.

3.  HPGC API

NOTE:    This section is for the implementation of high performance geospatial computing API by GeoLabs.

3.1.  HPGC API

This Chapter covers the HPGC API. The initial API is developed based on the OGC API — Processes Standard[14].

3.1.1.  API — Processes

In Testbed 19, OGC API — Processes was adopted as the base API for algorithm management, job scheduling, job monitoring, and workflow orchestration for HPGC.

3.1.2.  HPGC Profile of API — Processes

Common geospatial processing frameworks, geospatial algorithms, clustering frameworks, and typical workflow orchestrations can be implemented as processes or resource-based composite process to be managed through an API — Processes implementation instance. The processes were designed to support similar functions of the CyberGIS-Compute, but generalized to be usable with other high performance computing frameworks.

3.2.  HPGC API Client

The draft HPGC API is implemented as a profile of API — Processes Standard that is based on OpenAPI technology. The Processes API Core does not mandate any encoding or format for the formal definition of the API. The OpenAPI 3.0 specification is one option for defining the Processing API. As such a conformance class is specified for OpenAPI 3.0, which depends on the requirements class Core. While the use of OpenAPI 3.0 for the formal definition of the Processes API is not mandatory, the requests/responses of the Processes API specified are defined using OpenAPI 3.0 schemas. With OpenAPI technology, it is possible to generate clients in different languages using OpenAPI Generator, including C++, Java, Go, Python, and JavaScript. For Testbed 19, the Python client SDK was implemented with the assistance of automatic client generation against the OpenAPI specification of the draft HPGC API.

3.3.  ZOO-Project with HPC support implementation

GeoLabs reused its experience in accessing and using the HPC environment and the support available in the ZOO-Project. The ZOO-Project is an Open Source Reference Implementation of the OGC API — Processes — Part 1: Core and the OGC Web Processing Service (WPS) Standard 2.0.0 released under an MIT/X11 license.

In the past few years, GeoLabs developed support for the OGC API — Processes — Part 2: Deploy, Replace, Undeploy draft specification as part of the ZOO-Project. This provides the capability to deploy CWL packaged application in a Kubernetes cluster. The implementation follows the OGC Best Practice for Earth Observation Application Package. Moving the processing to the data rather than the data to the processing engine is now possible.

By combining support for the remote HPGC job scheduling with the one for the OGC API — Processes — Part 2: Deploy, Replace, Undeploy draft specification, an end-to-end solution for ease of interactions with HPGC is proposed. This solution partly covers the CyberGIS-Compute capabilities.

The source code for the HPGC was published in the official ZOO-Project GitHub repository. The Binary docker image is published on dockerhub and the corresponding Helm Chart on artifacthub to deploy the solution on a kubernetes cluster.

3.3.1.  Security consideration

When considering the client application for deploying and executing tasks on an HPC platform, which are limited resources allocated to the project, the system will require authentication to ensure that only authorized users can create and run new process resources. The security mechanism available in the ZOO-Project as a filter_in process is used. The concept of filter_in and filter_out was introduced during the implementation of the OGC API — Processes — Part 2: Deploy, Replace, Undeploy draft specification. These security filters, like any other ZOO-Kernel process, run for every received request. Filter_in processes run before the request is processed, while filter_out processes run after.

Figure 1 illustrates the implementation of authentication as a web-process. By providing such processes, the ability of a developer to change the default ZOO-Kernel behavior for any request is provided. For example, one can decide to implement a service that verifies that an authenticated user is authorized or not to access a given OGC API endpoint.

Authentication-implementation-OIDC

Figure 1 — Illustration of authentication implementation based on the OpenID Connect (OIDC)

The Keycloack software for authentication, which offers a state-of-the-art solution for setting up an OpenID Connect provider, was used. With it, a dedicated “OGC_TESTBED19_SECURED_AREA” realm was created for which both GitHub and GitLab (using gitlab.ogc.org) were configured as identity providers.

A dedicated authorized_users parameter in the security section to store the comma-separated list of users was added to allow access to the secured OGC API endpoints.

In addition to using Keycloack, another prototype server instance that uses the Authenix server to authenticate users was successfully implemented which is a closed-source solution providing a compliant OpenID Connect Authorization Server with federated SAML2-based login from Google, Facebook, and eduGAIN. By doing so, users can authenticate using their OGC portal credentials.

3.3.2.  DeployProcess

In the OGC API — Processes — Part 2: Deploy, Replace, Undeploy draft specification, the Deploy operation allows an authorized user to deploy a new process on the processing server. The draft specification describes that the server may implement support for deploying an OGC Application Package (using the encoding application/ogcapppkg+json). An OGC Application Package is a process package described using the OGC Application Package information model. The application package comprises a JSON object with two parameters: a processDescription and an executionUnit. The processDescription conforms to the processes.yaml schema. The process description part corresponds to what is obtained as a response for a GET request to the /processes/{processId} path. On the other hand, the executionUnit conforms to the executionUnit.yaml schema. The execution unit part defines what should be deployed on the processing server.

In the case of HPC, only Secure Shell Protocol (SSH) access to the HPC instance is made available. Consequently, deploying a new process on the processing server will require storing the associated metadata information and the singularity container [5] to run on the HPC in case the container for the process is unavailable. A container consists of an entire runtime environment: an application, plus all its dependencies, libraries, and configuration files bundled into one package [5]. Singularity is a tool for running software containers on HPC systems, similar to Docker [5]. Singularity containers are common and well supported by HPC systems [5].

To store the metadata information associated with the new process, a dedicated filter_in process, named securityIn, run by the ZOO-Kernel on every request was implemented. For deployment in an HPC environment, a dedicated service called DeployOnHpc is invoked through SSH to create a singularity container using the image provided in the executionUnit.

Suppose the filter_in process (securityIn) detects a Deploy operation using the application/ogcapppkg+json encoding. In that case, the process then parses the metadata information from the processDescription and invokes the execution of the DeployOnHpc process asynchronously by sending an Advanced Message Queuing Protocol (AMQP) message to the ZOO-FPM.

The sequence diagram below illustrates the method of deploying a new process using the prototype server implementation.

ZOO-Project-Deploy

Figure 2 — Sequence diagram ZOO-Project Deploy Operation

Figure 3 illustrates the example of the API for deploying a process.

GeoLabs deploy

Figure 3 — Illustration of API implementation for deploying a process

The Replace operation works the same way. If version management is unavailable for the deployed process, implementing the process as another filter_in process is possible. Version management would require clearly defining the expected numbering scheme and how to handle different versions of the same processes.