Open Geospatial Consortium |
Submission Date: 2020-05-22 |
Approval Date: 2020-06-05 |
Publication Date: 2020-07-22 |
External identifier of this OGC® document: http://www.opengis.net/doc/WP/GeoDataSci |
Internal reference number of this OGC® document: 20-001r2 |
Category: OGC® White Paper |
Editor: George Percivall |
Geospatial Data Science |
Copyright notice |
Copyright © 2020 Open Geospatial Consortium |
To obtain additional rights of use, visit http://www.opengeospatial.org/legal/ |
Warning |
This document is not an OGC Standard. This document is an OGC White Paper and is therefore not an official position of the OGC membership. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an OGC Standard. Further, an OGC White Paper should not be referenced as required or mandatory technology in procurements.
Document type: OGC® White Paper |
Document subtype: |
Document stage: Approved |
Document language: English |
License Agreement
Permission is hereby granted by the Open Geospatial Consortium, ("Licensor"), free of charge and subject to the terms set forth below, to any person obtaining a copy of this Intellectual Property and any associated documentation, to deal in the Intellectual Property without restriction (except as set forth below), including without limitation the rights to implement, use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Intellectual Property, and to permit persons to whom the Intellectual Property is furnished to do so, provided that all copyright notices on the intellectual property are retained intact and that each person to whom the Intellectual Property is furnished agrees to the terms of this Agreement.
If you modify the Intellectual Property, all copies of the modified Intellectual Property must include, in addition to the above copyright notice, a notice that the Intellectual Property includes modifications that have not been approved or adopted by LICENSOR.
THIS LICENSE IS A COPYRIGHT LICENSE ONLY, AND DOES NOT CONVEY ANY RIGHTS UNDER ANY PATENTS THAT MAY BE IN FORCE ANYWHERE IN THE WORLD.
THE INTELLECTUAL PROPERTY IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE DO NOT WARRANT THAT THE FUNCTIONS CONTAINED IN THE INTELLECTUAL PROPERTY WILL MEET YOUR REQUIREMENTS OR THAT THE OPERATION OF THE INTELLECTUAL PROPERTY WILL BE UNINTERRUPTED OR ERROR FREE. ANY USE OF THE INTELLECTUAL PROPERTY SHALL BE MADE ENTIRELY AT THE USER’S OWN RISK. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR ANY CONTRIBUTOR OF INTELLECTUAL PROPERTY RIGHTS TO THE INTELLECTUAL PROPERTY BE LIABLE FOR ANY CLAIM, OR ANY DIRECT, SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM ANY ALLEGED INFRINGEMENT OR ANY LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR UNDER ANY OTHER LEGAL THEORY, ARISING OUT OF OR IN CONNECTION WITH THE IMPLEMENTATION, USE, COMMERCIALIZATION OR PERFORMANCE OF THIS INTELLECTUAL PROPERTY.
This license is effective until terminated. You may terminate it at any time by destroying the Intellectual Property together with all copies in any form. The license will also terminate if you fail to comply with any term or condition of this Agreement. Except as provided in the following sentence, no such termination of this license shall require the termination of any third party end-user sublicense to the Intellectual Property which is in force as of the date of notice of such termination. In addition, should the Intellectual Property, or the operation of the Intellectual Property, infringe, or in LICENSOR’s sole opinion be likely to infringe, any patent, copyright, trademark or other right of a third party, you agree that LICENSOR, in its sole discretion, may terminate this license without any compensation or liability to you, your licensees or any other party. You agree upon termination of any kind to destroy or cause to be destroyed the Intellectual Property together with all copies in any form, whether held by you or by any third party.
Except as contained in this notice, the name of LICENSOR or of any other holder of a copyright in all or part of the Intellectual Property shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Intellectual Property without prior written authorization of LICENSOR or such copyright holder. LICENSOR is and shall at all times be the sole entity that may authorize you or any third party to use certification marks, trademarks or other special designations to indicate compliance with any LICENSOR standards or specifications. This Agreement is governed by the laws of the Commonwealth of Massachusetts. The application to this Agreement of the United Nations Convention on Contracts for the International Sale of Goods is hereby expressly excluded. In the event any provision of this Agreement shall be deemed unenforceable, void or invalid, such provision shall be modified so as to make it valid and enforceable, and as so modified the entire Agreement shall remain in full force and effect. No decision, action or inaction by LICENSOR shall be construed to be a waiver of any rights or remedies available to it.
- 1. Overview of White Paper
- 2. Overview of Geospatial Data Science
- 3. Data: Big Geospatial Data
- 4. Data: Data Scientists, Teams, Process
- 5. Data: Data Management
- 6. Tools: Geospatial Representations and Analytics
- 7. Tools: AI and Machine Learning for Geospatial
- 8. Tools: Models and Decisions
- 9. Data Science Applications and Ethics
- 10. Emerging Trends
- 11. OGC activities on Geospatial Data Science
- Annex A: Location Powers: Data Science Summit
- Annex B: Revision History
- Annex C: Bibliography
i. Abstract
This OGC White Paper describes Geospatial Data Science based on the Location Powers: Data Science Summit of November 2019. The white paper provides a description of the presentations and discussions of the summit along with recommendations for OGC activities to advance the field of Geospatial Data Science.
ii. Keywords
The following are keywords to be used by search engines and document catalogues.
ogcdoc, OGC document, Data Science, Analytics, Statistics, Artificial Intelligence, Machine Learning, Edge Computing, Knowledge-based Models, Data Management, IT Ethics, Heterogenous computing
iii. Preface
Geospatial Data Science is defined in this white paper as “The art and craft of people leveraging technology to create value out of data using location and time.” The components of geospatial data science are data, tools, applications, ethics, and emerging trends. The data component is composed of discussions about big geospatial data; data scientists, teams and process; and data management. The tools component is composed of discussions about geospatial representations and analytics, the application of machine learning to geospatial, and knowledge-based models to support decision making.
An objective of the white paper to serve as a basis for the promotion of geospatial data science within and external to OGC. OGC has a role to conduct activities that will advance innovation and standardization in geospatial data science. The overall objective is to enable beneficial use of geospatial information in humanities critical decisions.
iv. Submitting organizations
This document is prepared from material provided by organizations that planned and/or presented in the summit: AIST, Topio Networks, AWS, Orion Systems, City of Los Angeles, CrowdAI, Defense Digital Service, ESIP Federation, Esri, European Space Agency, Google, Health Solutions Research, JCC Consulting, MAXAR, NASA, NatureServe, NGA, NVIDIA, OmniSci, Oracle, Ordnance Survey UK, Pitney Bowes, Radiant Earth, SOFWERX, The Climate Corporation, University of Virginia, University of Maryland - College Park, University of Illinois - Urbana Champaign, University of Iowa, US Bureau of Labor Statistics, University of Southern California, and US Department of Transportation.
A full listing of organizations that participated in the Location Powers: Data Science Summit is in Annex A.
v. Submitters
All questions regarding this document should be directed to the editor: George Percivall, Open Geospatial Consortium
1. Overview of White Paper
Geospatial Data Science has been identified as an important technology development trend by the Open Geospatial Consortium (OGC). The OGC Technology Forecasting activity began focusing on data science as an outcome of the development of the Big Geospatial Data topic area. Both Big Data and Data Science have been topics in recent Location Powers Summits.
The Location Powers: Data Science Summit (LP_DS) organized by OGC was held on November 13 and 14, 2019, hosted by Google in Mountain View, CA. This Geospatial Data Science White Paper captures the content of the Summit and provide a basis for further action in OGC and beyond.
Location Powers Summits bring together industry, research, and government experts from across the globe into an interactive discussion that assesses the current situation and produces recommendations for future technology innovations and standards development. The Location Powers Summits are key to the technology innovation promoted by the OGC.
The Location Powers: Data Science Summit convened experts on data science, machine learning, artificial intelligence, cloud computing, remote sensing and GIS to assess the current situation of geospatial data science. Participation by leaders in social sciences, business development, government policy, and information technology led to recommendations with meaningful outcomes for geospatial data science development.
The LP_DS Summit considered the explosive availability of data about nearly every aspect of human activity along with revolutionary advances in computing technologies that is transforming geospatial data science. The shift from data-scarce to data-rich environment comes from mobile devices, remote sensing, and the Internet of Things. Nearly all of this data has components of location and time. Innovations in cloud computing and big data provides methods to perform data analytics at exceedingly large scale and speed. The development of intelligent systems using knowledge models and their impact on our insights and understanding was the focus of the LP_DS.
A summary of the topics discussed in the LP_DS is shown in the figure below.
This White Paper is organized as follows:
-
Data Topics
-
Big Geospatial Data (Clause 3)
-
Data Scientists, Teams, Process (Clause 4)
-
Data Management (Clause 5)
-
-
Tools
-
Geospatial Representations and Analytics (Clause 6)
-
AI and Machine Learning (Clause 7)
-
Models and Decisions (Clause 8)
-
-
Data Science Applications and Ethics (Clause 9)
-
Emerging Trends (Clause 10)
The Emerging Trends are: Edge Computing and Heterogeneous Computing
An Annex provides information about the summit including: the agenda and the organizations that participated in the Summit.
2. Overview of Geospatial Data Science
This definition was developed and repeated in several presentations and discussion sessions of the Location Powers Data Science Summit (LP_DS):
Geospatial Data Science is “The art and craft of people leveraging technology to create value out of data using location and time.”
To set the context for LP_DS, a definition for Data Science in the context of Big Data systems coming from NIST was considered. The NIST Big Data Interoperability Framework defines Data Science as the extraction of useful knowledge directly from data through a process of discovery, or of hypothesis formulation and hypothesis testing. The NIST document goes on to identify Data Science Sub-disciplines as 1) Mathematical and computer science foundations in statistics and machine learning; along with 2) Software and systems engineering methods to handle large data volumes and innovative query and analytics techniques; and, in some extended definitions, may include 3) domain data and processes.
Applying Data Science in the context of Geospatial Information is producing tremendous results. Geospatial information is experiencing the data explosion of mobile devices, remote sensing, and the Internet of Things perhaps more than other fields as all of these data types include location, spatial, and temporal information.
The Location Powers: Data Science Summit expanded beyond the topics listed above leading to this outline of key topics in Geospatial Data Science: Data, Tools, Applications, and Trends.
-
Data: It is obvious, but important, to state that Data is a core topic of data science. The availability of increasing availability of data triggered new possible analyses. Geospatial Data, which has always been big data, provides opportunities for analytics in data science. Therefore, the opening discussion of data is about Big Geospatial Data (Clause 3). For data science to be effective, data scientists needs to work in multi-disciplinary teams with an agile process. These topics are addressed in Data Scientists, Teams, Process (Clause 4). Managing big data requires addressing data policy along with the ecosystems and platforms to manage the data. Cloud-Native data management is providing nimble and novel methods to work with big data. These topics are addressed in Data Management (Clause 5)
-
Tools: Working with Big Data requires appropriate tools. As geospatial has always been big data, many of the geospatial analysis methods were data science before the term was introduced. Methods long familiar to the geospatial community along with extensions to those methods are addressed in the clause on Representation and analytics (Clause 6). The third wave of Artificial Intelligence has been lead by machine learning based such as convolutional neural networks. The application of machine learning to big geo data in particular imagery is addressed in AI and Machine Learning (Clause 7). Knowledge based data science depends upon models that are predictive of some portion of the geospatial world. Spatial decision support is supported by knowledge based models. These topics are address in the last tools clause on Models and Decisions (Clause 8).
-
Applications and Ethics. Applying Data Science to geospatial data is producing results which were discussed in the summit The Summit discussed nearly a dozen application areas. The applications discussion surfaced need for consideration of ethics regarding Data and Algorithms. (Clause 9)
-
Trends that look to be further advancing geospatial data science include Computing at the Edge and Heterogenous Computing. Each of these are addressed in Emerging Trends (Clause 10).
3. Data: Big Geospatial Data
The emergence of Data Science concepts and motivation can be traced to Jim Grey’s concepts in "The Fourth Paradigm: Data-Intensive Scientific Discovery," by Tony Hey, Stewart Tansley, and Kristin Tolle. This book surveys opportunities and challenges for data-intensive science to prepare for the data deluge of a “sensors everywhere” data infrastructure supporting a fourth paradigm of scientific research based on “Data Exploration.” A recurring theme in Location Powers: Data Science summit was that of "telling stories with data." Using stories to explore and understand the data from a domain results in insights not previously available. Data Science can be described as the exploration of big data about a domain.
This Clause addresses topics related to big data for data science.
-
Big Data with Location
-
Big Data Software Stack
-
Big Geo Data Use Cases
-
Recommendations
3.1. Big Data with Location
Geospatial data has always been big data was a theme of two Location Powers: Big Data summits and the resulting Big Geospatial Data – an OGC White Paper. The Big Geo Data white paper had these main themes:
-
Geospatial data is increasing in volume and variety;
-
New Big Data computing techniques are being applied to geospatial data;
-
Geospatial Big Data techniques benefit many applications; and
-
Open standards are needed for interoperability, efficiency, innovation and cost effectiveness.
The growth of geospatial highlighted in the Big Geo Data White Paper continues and is increasing. Patrick Griffiths, ESA, highlighted this trend during LP_DS. The ESA archives alone will be over 100 Petabytes by 2026.
Marc Armstrong, Univeristy of Iowa, at LP_DS described future satellite constellations that are being planned by different companies including Amazon and SpaceX. SpaceX is planning to deploy 12,000 satellites for communications, military, and scientific purposes. The revisit rate for viewing locations will increase dramatically. BlackSky is proposing 40 to 70 revisits each day. In addition to the static imagery, there is a lot of streaming video that’s going to be provided as well.
The Big Geo Data revolution is not only driven by remote sensing from satellites. Philippe Cases, Topio Networks, provided estimates to LP_DS on the magnitude of the data deluge coming from edge devices. All of this Edge Data has components of location and time that can be exploited in data science.
It is important to emphasis that this growing data has components of location and time. During LP_DS, Ed Parsons, Google, emphasized the ubiquity of location by introducing the definition of "ambient location."
Ambient Location adjective denoting or relating to a knowledge of a location that is continuously accessible. "A smartphone provides the user with an ambient location service."
3.2. Big Data Software Stack
The growth of a Big Data Stack drove development of a fundamentally different software computing platform. The birth of the Big Data Stack in late 1990s and early 2000s provided extreme flexibility and scalability in distributed batch applications for data at ever increasing volumes. The Modern Data Architecture provides a good summary of these developments and includes this figure.
At the core of the big data stack was Apache Hadoop, which started in 2006 as a spin-off from Apache Nutch, a web crawler that stemmed from Apache Lucene, the famous open source search engine. The inspiration for this project came from the Google File System and a distributed processing framework called MapReduce. These two components combined the extreme flexibility and scalability necessary to develop distributed batch applications in a simple way.
The use of Big Data Stack software for geospatial applications has been the theme of the Geospatial Track at the annual Apache Conference. The Apache Software Foundation has been a focal point for development of packages of the big data stack. These big data software packages have been extended with geospatial functionality and presented in the ApacheCon geospatial track. These items were presented in the ApacheCon 2019 Geospatial Track: GeoSpark built on Apache Spark, Apache Science Data Analytics Platform, GeoMesa on top of Accumulo, HBase, Cassandra, Geospatial Indexing and Search at Scale with Apache Lucene, Realtime Geospatial Analytics with GPUs, RAPIDS, and Apache Arrow
In later clauses of this white paper we will see how the Big Data Stack is important to data management (Clause 5), geospatial analytics (Clause 6), and Machine Learning (Clause 7).
3.3. Big Geo Data Use Cases
Milind Naphade, NVIDIA Metropolis, picked up on the LP_DS theme of big geo data discussing spatial intelligence. Exploiting this growth in data will require both cloud computing and Computing at the Edge (See Clause 10 for more on this emerging trend). Both the volume and the rate at which these data are coming requires pushing the processing closer to source at the edge. This will impact many vertical applications in terms of getting situational awareness.
The Big Geospatial Data – an OGC White Paper presented a set of use cases that apply across the application domains. The Use Cases were organized into four groups as shown in the figure. The use cases to the right of the figure provide a motivation for Geospatial Data Science.
3.4. Recommendations
This Clause motivates several recommendations.
-
Plan for the continued growth of Big Geo Data;
-
Continue to work with broad Big Data Stack to make geospatial data a routine data type for the broadest communities and to make the Big Data Stack extensible to complex analysis based on spatial temporal analytics;
-
Identify common geospatial Data Science Use cases that can be reused across applications; and
-
Promote geospatial big data developments in the Geospatial Track of ApacheCon. The geospatial track is chaired by OGC.
These recommendations are offered for uptake in OGC’s Big Data Domain Working Group.
4. Data: Data Scientists, Teams, Process
It is obvious, but important, to state that data management and processes are core topic of data science. For data science to be effective, data scientists needs to work in multi-disciplinary teams with an agile process.
This Clause addresses topics related to people and process for data science.
-
Data Scientists and multi-disciplinary teams
-
The role of tools: human augmentation
-
Data Science Process
-
Training and Institutionalization
-
Recommendations
4.1. Data Scientists and Multi-Disciplinary Teams
To be effective, Data Scientists must work with multi-disciplinary teams. The teams need to include individuals expert in the domain of study or application. Data Scientists cannot be effective by applying generic data science tools without tuning, interpretation, and guidance from a team that provides broad understanding of the domain area. Context of the domain is needed in order that the tools are used accurately.
Several quotes are representative of the LP_DS discussions:
-
“I teach my data scientists how to work on interdisciplinary teams” –Jeanne Holm, UCLA
-
“I want them to be respectful and understanding of the scope that storytellers bring, what geospatial experts bring, what policymakers bring”
-
“Going in as a data-scientist-with-the-answers was often counterproductive” - Regan Smyth, NatureServe
-
“Running models but not understanding the drivers and whether data science can help can be misleading” - discussion group
-
"Data scientists don’t learn to deploy and scale" - Underlining that their expertise alone is insufficient to solve all challenges facing companies and researchers today (such as cloud engineering).
The role of geospatial experts in the team was discussed as represented by these quotes:
-
“Geospatial analysts are domain experts; data scientists tend to sit more horizontal” – Devaki Raj, CrowdAI
-
“You don’t see data scientists necessarily learning geospatial technology but you expect you spatial technologists to learn data science” - discussion group
-
“Data science tools become more flexible for people that have domain expertise” - discussion group.
4.2. The Role of Tools: Human Augmentation
Without tools, data science would not be possible. But tools and the results they produce without human interpretation are not useful, or worse, they can be misleading. Tools that augment human intelligence are the most effective.
Several quotes are representative of these LP_DS discussions:
-
“We can’t tensorflow our way out of this problem” – Andy Brooks
-
“The r-squared is really high, but it’s garbage” – discussion group
-
“Opportunity for commercial vendors to implement for it to become routine. User interface need not change with new stuff going on under the hood.” - Marc Armstrong
-
“SWAT team of nerds” - Megan Furman
4.3. Data Science Process
Several approaches to Data Science process definition or methodology were presented and discussed at LP_DS.
Stephanie Shipp, University of Virginia, presented their Data Science Framework. The UVA framework emphasizes working with the project sponsors to identify the problem as defined by the sponsors. Beginning with those discussions sharpens the focus; along with looking at the literature and talking to experts. Then the data discovery aspect of the framework is preconditioned by the problem identification. You do not just start with the data that’s readily accessible. Then the process is like other data science frameworks where the data wrangling, profiling the data to assess your data quality. This is iterative work as data wrangling takes about 80% of your time. Reducing that load would leave more time for statistical modeling and analyses.