Mobility Data Science Discussion Paper

Open Geospatial Consortium

Submission Date: 2023-09-28

Approval Date: 2023-09-28

Publication Date: 2024-01-29

External identifier of this OGC® document: http://www.opengis.net/doc/dp/mobility-data-science

Internal reference number of this OGC® document: 23-056

Category: OGC® Discussion Paper

Editors: Song WU and Mahmoud SAKR

Mobility Data Science Discussion Paper

Copyright notice

To obtain additional rights of use, visit http://www.opengeospatial.org/legal/

License Agreement

Permission for use/distribution of this document and any associated materials is subject to the terms of this License Agreement: https://www.ogc.org/license

Warning

This document is not an OGC Standard. This document is an OGC Discussion Paper and is therefore not an official position of the OGC membership. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an OGC Standard. Further, an OGC Discussion Paper should not be referenced as required or mandatory technology in procurements.

Document type: OGC® Discussion Paper

Document subtype:

Document stage: Approved

Document language: English

Table of Contents

1. Mobility Data Science Summit
- 1.1. Summit Organizers
- 1.2. Contributors
2. Overview
3. What is special about ‘Mobility’ when it comes to Data Science?
4. What is the state of technology and tools ?
5. What is the state of standards ?
6. What are the open problems and challenges ?
7. References

1. Mobility Data Science Summit

1.1. Summit Organizers

Table 1. List of summit organizers
Name	Affiliation
Mahmoud Sakr	Université Libre de Bruxelles
Nobunobuhiro Ishimaru	Hitachi
Kyoung-Sook Kim	National Institute of Advanced Industrial Science and Technology (AIST)
Scott Simmons	OGC

1.2. Contributors

Table 2. List of document contributors
Name	Affiliation
Martin Desruisseaux	Geomatys
Cheng Fu	University of Zurich
Anita Graser	Austrian Institute of Technology
Charles Heazel	WiSC Enterprises
Pin Kung	Sky eyes GPS technologies
Johannes Lauer	HERE
Steve Liang	University of Calgary
Chris Little	UK Met Office
Mohamed Mokbel	University of Minnesota
George Percival	GeoRoundtable
Alex Ramage	Scottish Government / Transport Scotland
Rob Smith	Away Team Software
Stan Tillman	Hexagon AB
Esteban Zimanyi	Université Libre de Bruxelles

2. Overview

Almost every activity in our modern life leaves a digital trace, typically including location and time. Either captured by a sensor, manually input, or extracted from a social media post, the increase in the volume, variety, and velocity of spatiotemporal data is unprecedented. The ability to manage and analyze this data is important for many application domains, including smart cities, health, transportation, agriculture, sports, biodiversity, et cetera. It is critical to not only effectively manage and analyze the data but also to uphold privacy and ethical considerations. Since the civilian use of GPS was allowed in 1980, followed by the technological advances in other location tracking systems – wifi, RFID, bluetooth, etc., it is becoming more and more easy to track moving objects. The Mobility Data Science Summit was an opportunity to discuss the challenges of managing this data and making sense of it, with a focus on the tooling and standardization requirements.

Figure 1. Mobility Data Science Overview

Data science is commonly known as the pipeline of methods and tools from data acquisitions, until the delivery of useful insights, going through data cleaning, integration, management, and analysis. Many tools exist for helping data scientists in every step in this pipeline. Yet mobility data has its own characteristics that cannot be handled by common data science tools. Mobility data is typically available in the form of sequences of location points with time stamps that are generated by location tracking devices. So the is both multidimensional and time series, a structure that requires special data science tools and methods.

OGC has proactively envisioned the need for specialized data models and exchange formats, and formed working groups including moving features SWG and Temporal DWG. It is also natural that the temporal concepts found their way to the work of other working groups, such as GeoPose. This summit aimed to synchronize across working groups, and to align the concepts.

3. What is special about ‘Mobility’ when it comes to Data Science?

It is not surprising that mobility data can find important applications in a broad range of domains such as maritime, public transport, and logistics. In some sense, it can have a fundamental impact in many aspects of real-life.

Different from other types of data, such as spatial data and time series data, mobility data has several challenging characteristics.

Dynamic: mobility data records the evolving/changing properties of moving objects over time. One characteristic of mobility data is that once it has been collected and stored, it is very difficult or impossible to update or correct, i.e., a good practice is to only append new information to the original. However, most existing data infrastructures and formats were designed for static attributes, making them inapplicable for handling dynamic data, e.g., dynamic metadata in the map, which may require clever use of attributes in OpenStreetMap.
Diverse: because mobility data can be collected using various devices (e.g., GPS, Bluetooth, RFID) and sampling strategies, availability of mobility data and scale/frequency of collection varies considerably across different datasets. Therefore, analysis methods and tools may not be transferable across multiple types of data. Also, precision in mobility data can vary, so users need to consider scale and precision with respect to the science being explored, e.g., movement of people, wildlife tracking, agriculture, et cetera.
Heterogeneous: besides the common form of mobility data as a sequence of time-stamped points, other forms of mobility data exist including, but not limited to: discrete check-in data (e.g., geo-tagged tweet posts, ticketing/accounting data, and taxi pick-up/drop-off data), origin-destination OD flow data, schedules and realtime operations of public transport in the form of GTFS and GTFS-realtime, et cetera. For example, some public transport companies in Brussels are interested in developing a common ticketing scheme that can support multi modal transportation. However, such data usually do not have accurate coordinates/locations attached, making their analysis and modeling more complex. For example, interpolation on such data may not make sense at all, and the lack of continuous tracking of moving objects makes it hard or even impossible to do some aggregation analysis, such as centroid of movement or average of some scalar properties like speed, et cetera. A larger discussion to define which data sources and which methods should be included under the umbrella of mobility data science would be necessary, but delving into this is beyond the scope of the current paper.
Viewpoint: One unique aspect of mobility data is whether the data has an “Eulerian” or “Lagrangian” viewpoint. This differentiates, for instance, between the moving train carriage observed from the station platform, or the station platform observed from the carriage. Autonomous vehicle sensor systems may take either approach.

Figure 2. Mobility Data Computation Stack

As shown in Figure 2 Mobility Data Computation Stack, mobility data science brings requirements to several layers, from the low level of computation infrastructure to the high level of data modeling and various tasks.

Computation Infrastructure determines where and how mobility data can be generated, collated, derived, swapped, and archived.

Cloud Services: Cloud Services are widely deployed in recent years due to many advantages such as easy scalability and high availability. There are many cloud services on the market, not only the large providers like Azure and AWS. Many cloud platforms have their unique value for specialized use. However, most existing cloud services target general data. As a result, is it possible to build a mobility data cloud using some basic modules provided by these available cloud services now? To support mobility data science, the power is actually in coupling and connecting these cloud services (linked and available data) and cloud connectivity is required, because probably no one cloud fits everything. However, it remains a challenge to connect different cloud services to work together, because the mobility data from multiple clouds may have different semantics and were not necessarily planned to work together, so only a big data lake is not enough. Another issue is that cloud providers may offer everything that we need, but not necessarily organized or assembled in a fashion that is immediately useful. Also, it is relatively easier to collect data than use/analyze data, so a lot of information can be found on the cloud now, but the question remains on how much is made accessible and usable.
Edge Computing/IoT: Mobility data may go through several places from where it is generated to where it is used, e.g., from vehicle to vehicle edge to road edge to road network to cloud. Those places can differ greatly in terms of their computational power, and it is worthwhile to investigate which kinds of tasks are better put (assigned) at (to) which kinds of places, and what encoding formats are more suitable for those resource-constrained devices.
5G offers unprecedented ultra-low latency: This latency enables the capturing of features that require very high sampling frequency, such as orientation, which is relatively new compared to typical features like longitude-latitude. Another implication of 5G is that it is not about having to do tasks at the edge side or the cloud side, but it allows seamlessly moving data/computations around between edge and cloud. So coupled with the mobility data characteristics, these computation infrastructure call for innovative data modeling techniques and encoding formats such that mobility data can flow through various computational contexts/environments smoothly.

Next, mobility data tasks face more challenges than other types of data.

Fusion of Sensors (data aggregation/integration): Nowadays, moving objects are equipped with an array of sensors, and it is often the case that trajectories need to be built based on multiple sensor inputs. For example, cars are now full of sensors, creating a "data ocean”. Such fusion of multiple inputs allows analysis of heterogeneous data of independent moving features, thus enabling interesting applications such as autonomous robot navigation. However, the following issues still need to be solved:
1. varying levels of access to data causes problems with aggregation;
2. different sources were not necessarily planned to work together;
3. friction on aggregation is very high due to different data models and semantics;
4. although there has been considerable work on trajectories, aggregation of trajectory data may lead to better-fitting use cases;
5. for time-critical situations where every second matters, such as fire rescue and disaster risk management, how to efficiently aggregate multiple sensor inputs and stitch all kinds of dynamic data in real-time to support timely decision making is still a challenging problem; and
6. how to align sources with different spatial and temporal resolution.
Data Sharing: Some communities have drivers to ensure data sharing: ocean science and arctic studies - the date is so difficult to collect that researchers have to share information. Some communities share voluntarily very well, such as cyclists. Although most people agree that mobility data sharing can lead to improvements in society, there is still reluctance to share. Barriers include privacy, loss of competitive advantage, lack of cloud solutions, regulatory compliance, fear of losing control, interoperability issues, etc. Moreover, business models can be one way - sharing into a system without corresponding sharing outward.
Visualization: Visualization is a good way for people to explore mobility data. However, when dealing with massive amounts of mobility data, just plotting everything makes a big mess. So a common practice is to use GIS software to visualize aggregated results. In such cases it becomes necessary to do data aggregation and produce visual summaries that make sense for moving data. For example, mobility data can be represented as density maps and grid-based description of values and trends through "prototypes," e.g., showing density of objects moving north in a grid. In the geography community, Discrete Global Grid Systems (DGGS) is shown to be valuable, and grids are very useful in doing grid-based analysis on scalable computing clusters, particularly equal-area grids.
Mobility data science as a service: Mobility data science can be provided as a service, which can provide rich functionalities to help users better understand and utilize mobility data. Typical statistics are not enough for this purpose and users will be more interested to ask mobility questions, such as “Are there any times when two cars come close within 100 meters?” So how to express such requests in an API needs to be investigated, and the emerging OGC APIs and SQL may serve as a basis for such service interfaces.

In terms of general mobility data challenges, one special concern is privacy issues. Privacy can block analysis or can enable better analysis by using more restrictive data. This issue concerns not only humans, but also some commercial and endangered animals, which may also have security concerns. For example, cows can be tracked to provide business-sensitive analyses. For humans, due to the highly predictable nature of human behaviors, even small pieces of mobility data can lead to the leakage of identity information. So it is important to find the balance between utility of data and privacy preservation. Unfortunately, most privacy preservation methods restrict analysis ability and de-anonymization is always a concern. Notably, in some cases, the privacy issue may not exist when the question of interest can be answered at a mass level and data analysis does not need to focus on individuals.

Last but not least, evaluation of mobility data needs more effort in the future. Due to the particularities of mobility data, people need better characterization of data quality and more means to assess data quality, so that people can know whether the datasets at hand are suitable for the target analysis. Also, interoperability is an important aspect of, for example, integration of mobility data from different systems requires that those systems can talk to each other and understand each other’s data semantics.

The reader is referred to additional community publications that elucidate the differentiation between mobility data science and general data science, and could thus complement the discussion in this section [1-3].

4. What is the state of technology and tools ?

Currently, there are not enough common tools for mobility data science, because both mobility datasets and use-cases are so diverse. Existing analysis methods and tools are often not transferable across multiple types of data. This lack of widely-used tools is slowing down the community effort towards collaboratively building a mobility data science eco-system and tool-box. In terms of handling massive datasets, the existing big data tools are designed for general purposes and limited in ability to specifically handle mobility data, as a result, mobility data is not the first-class citizen in these tools.

Then a natural question is: which kind of tools are expected for mobility data science? Well, a first requirement is the capability of rapid processing of large mobility datasets, and the critical point is to make proper analysis using data reduction. For example, just plotting everything for visualization makes a big mess, so it is important to design visual summaries that make sense for visualization. A recent work reflecting this point is presented in [4], where the work models movement locations, directions, and speeds using “prototypes,” and supports exploration and anomaly detection.

Another example is Mapbox vector tiles, which can carry time information such that the returned tiles are temporal-aware, instead of the spatial-only ones served by traditional tile servers. To better fulfill the requirement of mobility data, the generation strategy of vector tiles can also take into account factors such as zoom level, viewport, and the amount of data being processed. Then secondly, consider the question “what can be done with the existing OGC Standards to enable richer queries and analysis?” For example, it is not enough to just answer questions based on a single trajectory, but we also need to think about use cases that go beyond a single trajectory to a group of trajectories. Afterwards, when some widely-used tools appear, an attempt can be made to structure/classify mobility datasets to derive some metadata that can help define use cases and give guidelines for certain types of analysis (Figure 3).

Figure 3. Evaluation of dataset suitability

During the summit, invited speakers brought examples of their work about creating tools for mobility data science, including QARTA [6], MobilityDB [5], and SensorUp [10].

QARTA is an open source map service featuring high accuracy and scalability. The main motivation behind QARTA is that both researchers and industry practitioners have put much effort into the efficiency of map services, so currently efficiency is no longer a bottleneck. Instead, the accuracy is becoming a bigger concern in such services. For example, even if the most efficient shortest path algorithm is available at hand, the query results would still be as inaccurate as the input map. With the idea that mobility data can be leveraged to boost the accuracy of map services, QARTA includes a Match or Make module (see Figure 4). Given a road network G and trajectory points P, this module will do map matching when G is more accurate than P, and vice versa. This module will perform map making to update G based on P when P is more accurate than G. In summary, QARTA’s success is due to two features: (1) QARTA uses machine learning to build its own highly accurate map, in terms of map topology, and more importantly, in terms of dynamic metadata like edge weights of the road network; and (2) QARTA employs machine learning to calibrate its query answers based on various contextual information. Currently, QARTA has been deployed in all taxis and the third largest food delivery company in the state of Qatar and performed as well, or even better than, commercial map services.

Figure 4. Match or Make (taken from [6])

MobilityDB is an open source geospatial trajectory management and analysis platform, which is built on top of PostgreSQL and PostGIS. With the aim to be a mainstream system for industry use, MobilityDB provides many benefits including:

compact geospatial data storage;
rich mobility analytics;
easy-to-use full SQL interface; and
compliance with OGC Moving Features Standards, et cetera.

To support efficient management of mobility data, MobilityDB implements multiple temporal types, such as tgeogpoint for a temporal geography point and tfloat for dynamic attributes including speed, heading, and so on. Currently, MobilityDB is in active development, and more functionalities will be provided or enhanced.

Figure 5. MobilityDB Architecture

The SensorUp software [10] serves as the data fabrics between an organization’s real-time data, its prediction capabilities, and operational execution. By stitching dynamic data sources from various places, the software makes it possible to build interoperable digital infrastructure for real-time data-driven operations and decision making. Many use cases can benefit from this kind of infrastructure, such as digital assistance for firefighters before flashover, dispatch inspection, and repair before equipment failure, et cetera.

Figure 6. SensorUp Software

Cross-scale aggregation, visualization and analytics are useful when handling big trajectory data. Inspired by this, a quad-tree based trajectory simplification approach is presented in [7], where the spatial distribution of POIs determines the degree of trajectory simplification. So in areas with a higher density of POIs, a trajectory will be less simplified.

Figure 7. Multi-level Trajectory Simplification

5. What is the state of standards ?

Because OGC standards focus mainly on static geospatial data, the support of mobility data may be disruptive to the existing infrastructure. In the future, more standards will be needed that take into account the dynamic nature of mobility data. A good example of such standards is the OGC SensorThings API [8], which provides an open, geospatial-enabled, and unified way to interconnect the Internet of Things (IoT) devices, data, and applications over the Web. Good standards like this can bring many benefits, such as promoting best practices in industry, saving time for collaboration, reducing communication cost, enabling interoperability among multiple systems, et cetera.

Although many categories of relevant standards exist as shown in Figure 8, such as public transport, ticketing, traffic, spatial data, and so on, ‘mobility data standards’ is still in its infancy. Much of what is needed is understood but just getting started. Figure 9 outlines the on-going standardization work for mobility data, which currently focus more on expressiveness but not data size issues (e.g., caused by the capturing of highly-dynamic features like orientation).

Figure 8. Existing mobility-related standards

Figure 9. Ongoing Moving Feature standardization effort

Specifically, the following points are identified and agreed to be worthwhile for further standardization work.

Data quality/validation. Since the collecting strategy of mobility data is so diverse, better characterization of data quality becomes necessary and more means to assess data quality are needed. Otherwise, data reliability for desired analysis will not lead to the right results. Specifically, the following issues are considered.
- What can we do with the data (license, quality, suitability)?
- A moving features standard should have an uncertainty field (e.g., for marking missing segments).
- Uncertainty/poor reliability of data integration may be an issue due to poor conventions, e.g., multiple ways to write an address.
Data cleaning. Data cleaning is a necessary step before any kind of analysis. For example, the existing ISO 19115 metadata Standard [6] does have a place to describe source and processing of data, so it is possible to store information about which operations have been done on the dataset and the parameters used during processing.
Data pipelines. Standards are not only needed for representing/capturing/storing mobility data, but also for their processing. One problem in data science is that the processing by the scientists impacts the results. Because there are too many variables in the processing pipeline, repeatability of results can be a problem. Is there a way to create guidelines to place some disciplines using similar pipelines? For example, in text analysis, there are some standard measures/procedures to derive summaries/insight from text data. For example, TF-IDF is widely used to compute the relevance score of a document to the query keywords. Such standards on pipelines would enhance consistency and repeatability of mobility data analysis.

Figure 10. Towards a discipline for mobility data science pipelines

Routing standards. Routing standards can be useful in working with end-point data. Currently there are two routing-related specifications. One of them includes a route exchange model, and the other one specifies how to start, end, and return results by routing algorithms. Service of dynamic data needs more work, since OGC Standards focus on static data until now.
Query and analysis. What can be done with the existing OGC Standards to enable query and analysis? For example, proximity analysis is not well considered in the standards. A conceptual model in needed before building a Web API.
Privacy. OGC is exploring Geoethics and how to balance privacy preservation and data utility is becoming an increasing concern. Standards respecting privacy can also promote mutual trust between stakeholders to facilitate the* sharing of mobility datasets with others.
Visualization. There are standards for general visualization that may not have explicit guidances on time. Business requirements for dashboards vary based on industry use. Dashboard frameworks for trajectories include geometry, semantics, and movement parameters (e.g., velocity and heading). One recommendation is to suggest best visualization widgets for certain types of data.
Ground truth/domain expertise. Collecting ground truth is important for validating analysis results. Although some best practices can directly obtain semantic information from sensor inputs (e.g., SensorThings which can record truck loading/unloading, and dwelling), this is not always an easy thing to do. Especially, human activities may not be simply sensed due to technical limitations or privacy issues. A common practice to infer semantics for raw trajectory data is using the nearest POI, but the nearest POI is not always the place where activities occurred. Also, domain expertise is needed to interpret the nature of the data and the results of analysis.
FAIR principles. What can be done with standards to enhance alignment with FAIR principles? Much mobility data is collected, but how much is made accessible? Interoperability is also desired so that different systems can talk to each other and exchange information such as warnings, driving/resting times, transport details, et cetera.
Perspective/Context/Assumption. Mobility data is very broad in scope, so perspective needs to be considered. For example, geo scientists tend to see location as the critical point, from their perspective, and they may need to step out of their "geo shoes" to see other perspectives. Context is important - assumptions can impact analysis results, so it is often better to just take the data as-is and not make assumptions. For example, Amazon has a patent for "anticipatory shipping," and some logistics companies may dispatch trucks before service is requested. Where does such context information fit in mobility data standards?
Multimodal-awareness. One future characteristic of mobility data is that it is multimodal. However, mobility data are usually collected in different cities and departments (road, bus, train, etc), so data standards can play a critical role here to compose multimodal mobility data for individuals using data from different departments.

6. What are the open problems and challenges ?

The Open Geospatial Consortium (OGC) has defined a number of widely adopted standards for geospatial data. These standards primarily focus on static geospatial data, such as maps, satellite images, or geographical features. However, with the increasing availability of real-time or dynamic geospatial data, such as tracks of moving objects or live weather data, there is a need to extend these standards. ‘Mobility standards’ are still nascent. Much of what is needed is understood, but just getting started
Considerable effort is going into tracking where people go, but not all of the information is being used. Analytics that are valuable are necessary, especially to avoid privacy issues by not focusing not on individuals, but on mass data analysis. Data density can also be too large for rapid processing. Data reduction is critical to make proper analysis.
Continuing to collect extensive amounts of data and simply storing this data in a large repository, often referred to as a 'data lake', is insufficient. The true value of data lies not just in its volume but in the understanding, interpretation, and connection of the data in meaningful ways. The current challenge lies in bridging the gap between mere data collection and effective data utilization. To unlock the full potential of the data, a comprehensive approach is needed that includes semantics - understanding the meaning, context, and relationships within the data.
Check-ins related to mobility data refer to the information that is collected when an individual "checks in" at a particular location, often through a mobile App. This data can be highly valuable in a range of fields, including urban planning, transport, and social sciences. Despite its speed, there seems to be lack of focus in managing and processing this data.
OGC, as part of its commitment to improving geospatial interoperability, has begun to explore Geoethics. This interdisciplinary field investigates the ethical, legal, and social implications of geoscience. One potential avenue to explore would involve studying the privacy of mobility data as a component of these endeavors.
Cloud providers indeed offer an extensive range of services and resources, which encompass almost anything one might need. However, the challenge lies in the fact that these resources aren’t always packaged or structured in a manner that’s straightforwardly useful. This brings us to the challenge of mobility data science as a service. Currently, a significant amount of data related to mobility is stored across various cloud providers. While this data is collected and available, at least for its owners, it’s not necessarily accessible or organized in a useful manner for data scientists, businesses, or other interested parties. Creating a mobility data science as a service would involve curating this data, organizing it, ensuring its quality, and perhaps even pre-processing it for certain common tasks.
Also, maybe for future work, the well-established, relatively rigorous approaches used in meteorological and oceanographic sciences for processing large amounts of heterogenous, mobile, volatile, error-prone data could be considered. The processes recognize errors that may be because of transcriptions, telecoms, instrumental, calibration, imprecise transfer functions, and may be random or correlated. Quality control and correction can use sophisticated mathematical techniques such as optimum interpolation and variational analysis, as well as statistically based approaches.

7. References

[1] Mokbel, M., Sakr, M., Xiong, L., Züfle, A., Almeida, J., Aref, W., … & Zimányi, E. (2023). Towards Mobility Data Science (Vision Paper). arXiv preprint arXiv:2307.05717.

[2] Sakr, M., Ray, C., & Renso, C. (2022). Big mobility data analytics: recent advances and open problems. GeoInformatica, 26(4), 541-549

[3] Mokbel, M., Sakr, M., Xiong, L., Züfle, A., Almeida, J., Anderson, T., … & Zimányi, E. (2022). Mobility data science (dagstuhl seminar 22021). In Dagstuhl reports (Vol. 12, No. 1). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

[4] Graser. A., Widhalm, P., & Dragaschnig, M. (2020). The M³ massive movement model: a distributed incrementally updatable solution for big movement data exploration. International Journal of Geographical Information Science. doi:10.1080/13658816.2020.1776293.

[5] Zimányi, E., Sakr, M., & Lesuisse, A. (2020). MobilityDB: A mobility database based on PostgreSQL and PostGIS. ACM Transactions on Database Systems (TODS), 45(4), 1-42.

[6] Musleh, M., Abbar, S., Stanojevic, R., & Mokbel, M. (2021). QARTA: an ML-based system for accurate map services. Proceedings of the VLDB Endowment, 14(11), 2273-2282.

[7] Fu, Cheng; Huang, Haosheng; Weibel, Robert (2021). Adaptive simplification of GPS trajectories with geographic context – a quadtree-based approach. International Journal of Geographical Information Science, 35(4):661-688.

[8] https://www.ogc.org/standards/sensorthings

[9] https://www.iso.org/standard/53798.html

[10] https://sensorup.com/