Publication Date: 2020-02-12

Approval Date: 2020-02-12

Submission Date: 2019-12-20

Reference number of this document: OGC 19-015

Reference URL for this document:

Category: OGC Public Engineering Report

Editor: Stephane Fellah

Title: OGC Testbed-15: Federated Cloud Provenance ER

OGC Public Engineering Report


Copyright © 2020 Open Geospatial Consortium. To obtain additional rights of use, visit


This document is not an OGC Standard. This document is an OGC Public Engineering Report created as a deliverable in an OGC Interoperability Initiative and is not an official position of the OGC membership. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an OGC Standard. Further, any OGC Public Engineering Report should not be referenced as required or mandatory technology in procurements. However, the discussions in this document could very well lead to the definition of an OGC Standard.


Permission is hereby granted by the Open Geospatial Consortium, ("Licensor"), free of charge and subject to the terms set forth below, to any person obtaining a copy of this Intellectual Property and any associated documentation, to deal in the Intellectual Property without restriction (except as set forth below), including without limitation the rights to implement, use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Intellectual Property, and to permit persons to whom the Intellectual Property is furnished to do so, provided that all copyright notices on the intellectual property are retained intact and that each person to whom the Intellectual Property is furnished agrees to the terms of this Agreement.

If you modify the Intellectual Property, all copies of the modified Intellectual Property must include, in addition to the above copyright notice, a notice that the Intellectual Property includes modifications that have not been approved or adopted by LICENSOR.


This license is effective until terminated. You may terminate it at any time by destroying the Intellectual Property together with all copies in any form. The license will also terminate if you fail to comply with any term or condition of this Agreement. Except as provided in the following sentence, no such termination of this license shall require the termination of any third party end-user sublicense to the Intellectual Property which is in force as of the date of notice of such termination. In addition, should the Intellectual Property, or the operation of the Intellectual Property, infringe, or in LICENSOR’s sole opinion be likely to infringe, any patent, copyright, trademark or other right of a third party, you agree that LICENSOR, in its sole discretion, may terminate this license without any compensation or liability to you, your licensees or any other party. You agree upon termination of any kind to destroy or cause to be destroyed the Intellectual Property together with all copies in any form, whether held by you or by any third party.

Except as contained in this notice, the name of LICENSOR or of any other holder of a copyright in all or part of the Intellectual Property shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Intellectual Property without prior written authorization of LICENSOR or such copyright holder. LICENSOR is and shall at all times be the sole entity that may authorize you or any third party to use certification marks, trademarks or other special designations to indicate compliance with any LICENSOR standards or specifications.

This Agreement is governed by the laws of the Commonwealth of Massachusetts. The application to this Agreement of the United Nations Convention on Contracts for the International Sale of Goods is hereby expressly excluded. In the event any provision of this Agreement shall be deemed unenforceable, void or invalid, such provision shall be modified so as to make it valid and enforceable, and as so modified the entire Agreement shall remain in full force and effect. No decision, action or inaction by LICENSOR shall be construed to be a waiver of any rights or remedies available to it.

None of the Intellectual Property or underlying information or technology may be downloaded or otherwise exported or reexported in violation of U.S. export laws and regulations. In addition, you are responsible for complying with any local laws in your jurisdiction which may impact your right to import, export or use the Intellectual Property, and you represent that you have complied with any regulations or registration procedures required by applicable law to make this license enforceable.

Table of Contents

1. Subject

The emergence of Federated Cloud processing and ‘Big Data’ have raised many concerns over the use to which data is being put. This led to new requirements for methodologies, and capabilities which can address transparency and trust in data provenance in the Cloud. Distributed Ledger Technologies (DLTs) and more specifically blockchains, have been proposed as a possible platform to address provenance. This OGC Testbed 15 Engineering Report (ER) is a study of the application of DLTs for managing provenance information in Federated Clouds.

2. Executive Summary

Cloud computing has been widely adopted by the commercial, research and military communities. To support computing "on-demand" and "pay-as-you-go" models, cloud computing extends distributed and parallel system architecture by using abstraction and virtualization techniques. These environments are composed of heterogeneous hardware and software components from different vendors on which complex workflows can be executed using a federated orchestration of the execution of these workflows.

Assurance of the quality and repeatability of data results is essential in many fields (eScience, and healthcare for example) and requires cloud auditing and the maintenance of provenance information for the whole workflow execution. The use of heterogeneous components in cloud computing environment introduces the risks of accidental data corruption, processing errors, vulnerabilities such as security violation, data tampering or malicious forgery of provenance. Cloud systems are structured in a fundamentally different way from other distributed systems, such as grids, and therefore present new challenges for the collection of provenance data.

Current scientific workflows do not provide a standard way to share provenance. Existing workflow management systems integrating provenance repositories are typically proprietary and are not interoperable with other systems in mind. Federated Cloud Architectures exacerbate the challenge of tracking and sharing provenance information.

The sharing of provenance from scientific workflows would enable the rapid reproduction of results and enable the rapid computing of new and significant results using the history to generate new workflow definition using minor modifications to the original workflow. The ability to share provenance will greatly reduce duplication of workflows, improve the trust and integrity of data and analyses, improve reproducibility of scientific workflows and catalyze the discovery of new knowledge. While these may be relatively simple to achieve in a single, well-designed workflow management system that captures provenance, there is no readily available general-purpose solution, especially for cloud-based environment.

The scope of this study is to review the state-of-the-art of Provenance and Blockchain technologies, identify the challenges and requirements about using cloud computing provenance on a blockchain. Based on these analyses, the authors of this ER propose an architecture to share provenance information from federated cloud workflows that ensure the provenance information has not be tampered with so that user can trust the results produced by the workflow. This study is not about defining a model of provenance to reproduce workflows, though the authors will indicate some good candidates to address that challenge.

The findings of the study determine that W3C Self Sovereign Identifiers (SSIs) and Verifiable credentials are fundamental assets for interaction over the Internet and are the cornerstone of establishing the Web Of Trust needed to ensure provenance of information. SSI brings back full control of the identity to the owner and the use of DLTs and Blockchain to support Decentralized PKI provides a solid alternative that addresses the usability and security issues of the centralized PKI approach. SSIs and Verifiable credentials are still young technologies, but the development of these standards is moving at rapid pace, and will have profound impact on the current web technologies, by creating Web 4.0.

2.1. Document contributor contact points

All questions regarding this document should be directed to the editor or the contributors:


Name Organization Role

Stephane Fellah

Image Matters LLC


Anna Burzykowska

European Space Agency


2.2. Foreword

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. The Open Geospatial Consortium shall not be held responsible for identifying any or all such patent rights.

Recipients of this document are requested to submit, with their comments, notification of any relevant patent claims or other intellectual property rights of which they may be aware that might be infringed by any implementation of the standard set forth in this document, and to provide supporting documentation.

3. References

The following normative documents are referenced in this document.

4. Terms and definitions

For the purposes of this report, the definitions specified in Clause 4 of the OWS Common Implementation Standard OGC 06-121r9 shall apply. In addition, the following terms and definitions apply.

● 51% Attack

When more than half of the computing power of a cryptocurrency network is controlled by a single entity or group, this entity or group may issue conflicting transactions to harm the network, should they have the malicious intent to do so.

● Address

Addresses (Cryptocurrency addresses) are used to receive and send transactions on the network. An address is a string of alphanumeric characters, but can also be represented as a scannable QR code.

● Blockchain

A particular type of data structure used in some distributed ledgers, which stores and transmits data in packages called ‘blocks’, connected to each other in a digital ‘chain’. Blockchains employ cryptographic and algorithmic methods to record and synchronise data across a network in an immutable manner.

● Blockchain transaction

A Blockchain transaction can be defined as a small unit of task, stored in public records. These records are also known as ‘blocks’. These blocks are executed, implemented and stored in blockchain only after validation by the entities in the blockchain network.

● Central Ledger

A central ledger refers to a ledger maintained by a central agency.

● Claim

A statement about an identity. This could be: a fact, such as a person’s age; an opinion, such as a rating of their trustworthiness; or something in between, such as an assessment of a skill.

● Confirmation

A confirmation means that the blockchain transaction has been verified by the network. This happens through a process known as mining, in a proof-of-work system (e.g. Bitcoin). Once a transaction is confirmed, it cannot be reversed or double spent. The more confirmations a transaction has, the harder it becomes to perform a double spend attack.

● Consensus

Consensus is achieved when all participants of the network agree on the validity of the transactions, ensuring that the ledgers are exact copies of each other.

● Credential

A set of one or more claims made by an issuer. A verifiable credential is a tamper-evident credential that has authorship that can be cryptographically verified. Verifiable credentials can be used to build verifiable presentations, which can also be cryptographically verified. The claims in a credential can be about different subjects.

● Cryptocurrency

A form of digital currency based on mathematics, where encryption techniques are used to regulate the generation of units of currency and verify the transfer of funds. Furthermore, cryptocurrencies operate independently of a central bank.

● Cryptography

A method for securing communication using code. The main example of cryptography in cryptocurrency is the symmetric-key cryptography used in the Bitcoin network. Bitcoin addresses generated for the wallet have matching private keys that allow for the spending of the cryptocurrency. The corresponding public key coupled with the private key allows funds to be unlocked. This is one example of cryptography in action.

● Decentralization

The transfer of authority and responsibility from a centralized organization, government, or party to a distributed network.

● Decentralized Application (DApp)

DApp is a decentralized application, running on a decentralized peer-to-peer network as opposed to running on centralized servers.

● Decentralized Identifier (DID)

A globally unique identifier that does not require a centralized registration authority because it is registered with distributed ledger technology (DLT) or other form of decentralized network.

● Digital currencies

Digital currencies are digital representations of value, denominated in their own unit of account. They are distinct from e-money, which is a digital payment mechanism, representing and denominated in fiat money.

● Digital Identity

A digital identity is an online or networked identity adopted or claimed in cyberspace by an individual, organization, or electronic device.

● Distributed Ledger Technology (DLT)

DLT refers to a novel and fast-evolving approach to recording and sharing data across multiple data stores (or ledgers). This technology allows for transactions and data to be recorded, shared, and synchronized across a distributed network of different network participants.

● Ethereum

Ethereum is the open-source, public, blockchain-based distributed computing platform and operating system, featuring smart contract functionality.


The Ethereum Virtual Machine (EVM) is a Turing complete virtual machine that allows anyone to execute arbitrary EVM Byte Code. Every Ethereum node runs on the EVM to maintain consensus across the blockchain.

● Fork

A fork creates an alternative version of a blockchain. The two chains run simultaneously on different parts of the network. They can be either accidental or intentional.

● Genesis Block

The very first block in a block chain.

● Identity Provider

An identity provider, sometimes abbreviated as IdP, is a system for creating, maintaining, and managing identity information for holders, while providing authentication services to relying party applications within a federation or distributed network. In this case the holder is always the subject. Even if the verifiable credentials are bearer credentials, it is assumed the verifiable credentials remain with the subject, and if they are not, they were stolen by an attacker. This specification does not use this term unless comparing or mapping the concepts in this document to other specifications. This specification decouples the identity provider concept into two distinct concepts: the issuer and the holder.

● Immutable

An inability to be altered or changed over time. This refers to a ledger’s inability to be changed by a single administrator, all data once written onto a blockchain can be altered.

● Internet of Things (IoT)

Internet of Things is a network of objects, linked by a tag or microchip, that send data to a system that receives it.

● InterPlanetary File System (IPFS)

Distribution protocol that started as an open source project at Interplanetary Networks. The p2p method of storing and sharing hypermedia in a distributed file system aims to help applications run faster, safer and more transparently. IPFS allows objects to be exchanged and interact without a single point of failure. IPFS creates trustless node interrelations.

● Ledger

An append-only record store, where records are immutable and may hold more general information than financial records.

● Mining

Mining is the act of validating blockchain transactions. The necessity of validation warrants an incentive for the miners, usually in the form of coins.

● Multi Signature

Multi-signature (multisig) addresses allow multiple parties to require more than one key to authorize a transaction. The needed number of signatures is agreed at the creation of the address. Multi signature addresses have a much greater resistance to theft.

● Node (Full Node)

A computer connected to the blockchain network is referred to as a ‘node’. Most nodes are not full nodes and full nodes can be difficult to run due to their bulky size. A full node is a program can fully validate transactions and blocks bolstering the p2p network.

● Oracle

An oracle helps communicate data using smart contracts connecting the real world and blockchain. The oracle finds and verifies events and gives this information to the smart contract on the blockchain.

● Participant

An actor who can access the ledger: read records or add records to.

● Peer

An actor that shares responsibility for maintaining the identity and integrity of the ledger.

● Peer to Peer (P2P)

Peer-to-peer (P2P) refers to the decentralized interactions that happen between at least two parties in a highly interconnected network. P2P participants deal directly with each other through a single mediation point.

● Permissioned Ledger

A permissioned ledger is a ledger where actors must have permission to access the ledger. Permissioned ledgers may have one or many owners. When a new record is added, the ledger’s integrity is checked by a limited consensus process. This is carried out by trusted actors — government departments or banks, for example — which makes maintaining a shared record much simpler that the consensus process used by unpermissioned ledgers. Permissioned block chains provide highly-verifiable data sets because the consensus process creates a digital signature, which can be seen by all parties. A permissioned ledger is usually faster than an unpermissioned ledger.

● PoS/Pow Hybrid

a combination of Proof of Stake (PoS) and Proof of Work (PoW) consensus protocols on a blockchain network. Blocks are validated from not only miners, but also voters (stakeholders) to form a balanced network governance.

● Private Blockchain

A closed network where blockchain permissions are held and controlled by a centralized entity. Read permissions are subject to varying levels of restriction.

● Public Address

A public address is the cryptographic hash of a public key. They act as email addresses that can be published anywhere, unlike private keys.

● Private Key

A private key is a string of data that allows you to access the tokens in a specific wallet. They act as passwords that are kept hidden from anyone but the owner of the address.

● Proof-of-Authority

A consensus mechanism in a private blockchain that grants a single private key the authority to generate all of the blocks.

● Proof of Stake

A consensus distribution algorithm that rewards earnings based on the number of coins you own or hold. The more you invest in the coin, the more you gain by mining with this protocol.

● Proof of Work

A consensus distribution algorithm that requires an active role in mining data blocks, often consuming resources, such as electricity. The more ‘work’ you do or the more computational power you provide, the more coins you are rewarded with.

● Protocol

A set of rules that dictate how data is exchanged and transmitted. This pertains to cryptocurrency in blockchain when referring to the formal rules that outline how these actions are performed across a specific network.

● Public Blockchain

A globally public network where anyone participate in transactions, execute consensus protocol to help determine which blocks get added to the chain, and maintain the shared ledger.

● Public Key Cryptography

Public Key Cryptography is an asymmetric encryption scheme that uses two sets of keys: a public key that is widely disseminated, and a private key known only to the owner. Public key cryptography can be used to create digital signatures, and is used in a wide array of applications, such as HTTPS internet protocol, for authentication in critical applications, and also in chip-based payment cards.

● SHA-256

SHA-256 is a cryptographic algorithm used by cryptocurrencies such as Bitcoin. However, it uses a lot of computing power and processing time, forcing miners to form mining pools to capture gains.

● Smart contract

Smart contract is a computer protocol intended to digitally facilitate, verify, or enforce the negotiation or performance of a contract. In this paper ‘smart contract’ is mostly used in the sense of general purpose computation that takes place on a blockchain or distributed ledger. In this interpretation, a ‘smart contract’ is not necessarily related to the classical concept of a contract, but can be any kind of computer program or code-executed task on blockchain.

● Solidity

Solidity is Ethereum’s programming language for developing smart contracts.

● Token

A Token is a representation of a digital asset. It typically does not have intrinsic value but is linked to an underlying asset, which could be anything of value.

● Unpermissioned ledgers

Unpermissioned ledgers such as Bitcoin have no single owner — indeed, they cannot be owned. The purpose of an unpermissioned ledger is to allow anyone to contribute data to the ledger and for everyone in possession of the ledger to have identical copies. This creates censorship resistance, which means that no actor can prevent a transaction from being added to the ledger. Participants maintain the integrity of the ledger by reaching a consensus about its state

● Verifiable Claim

A Verifiable Claim is machine-readable information that can be verified by a third party on the Web. Such a claim is effectively tamper-proof and its authorship can be cryptographically verified. Multiple claims may be bundled together into a set of claims

4.1. Abbreviated terms

  • API Application Programming Interface

  • CAS Content Addressable Storage

  • DAO Decentralized Autonomous Organization

  • DID Decentralized Identifier

  • DIF Decentralized Identity Foundation

  • DPKI Decentralized Public Key Infrastructure

  • DLT Distributed Ledger Technology

  • DNS Domain Name System

  • ERC Ethereum Request for Comments

  • ETH Ethereum

  • FIM Federated Identity Managemen

  • HTTP Hyper-Text Transfer Protocol

  • IDMS Identity Management System

  • IP Internet Protocol

  • IPFS Inter-Planetary File System

  • ISO International Organization for Standardization

  • JSON JavaScript Object Notation

  • JSON-LD JavaScript Object Notation for Linked Data

  • OGC Open Geospatial Consortium

  • PII Personally-Identifiable Information

  • SSI Self-Sovereign Identity

  • URI Uniform Resource Identifier

  • URL Uniform Resource Locator

  • W3C World Wide Web Consortium

  • ZK Zero-Knowledge

  • ZKP Zero-Knowledge Protocol

  • W3C World Wide Web Consortium

  • XML eXtensible Markup Language

5. Overview

Until recently, data analysts designed their algorithms with the assumption that the analyzed data were gathered into a centralized repository, such as in a cloud data center or a data lake (a storage repository that holds a vast amount of raw data in its native format until it is needed). A paradigm shift is now occurring with the exponential growth of Big Data due to the rise of the Internet of Things (IoT), social media, mobility, and other data sources. This growth defies the scalability of centralized approaches to store and analyze data in a single location. For example, some sensors generate and store data locally, as moving the data to a centralized location makes it impractical due to bandwidth and cost constraints. In other cases, data centralization may not be possible due to security concerns for data in transit, governance, privacy, or regulatory compliance issues that limit the movement of data beyond certain geographic boundaries.

Analytics in a centralized repository is becoming challenging as data becomes more and more distributed. If we cannot bring data together for analysis, then analytics needs to be taken to the data. This occurs often in controlled, well-defined and well-secured places at the edge, in the fog or core, and in the cloud or enterprise data centers. Further, intermediate results may need to be fused and analyzed together as well. This is where federated cloud and federated analytics enters the picture. There are a number of challenges to overcome:

  • How to redesign the data analytic algorithms to reason and learn in a federated manner?

  • How to distribute the analytics close to where the data is collected?

  • How to aggregate and analyze together the intermediate results to drive higher-order learning at scale?

  • How to ensure the quality and repeatability of data results of distributed workflow in federated environment?

  • How to audit the execution of distributed workflow in a federated environment to analyze the validity of analysis or identify faults in execution?

  • How to verify the integrity of all the participants and all the data sources in the analytics process? In particular, how to get the same assurances of trust, transparency and traceability (the “Three Ts” of data analytics) that you would have in a centralized world?

How do you get there? One answer is to combine the unique capabilities of federated analytics and blockchain technology, which adds a distributed ledger to the federated analytics solution. Identity Management and Provenance information play a central role in the solution. This is the topic of the study documented in this OGC ER.

Section 6 and 7 provide background information and related works about provenance and workflow related technologies, in particular in a federated cloud environment.

Section 8 provides background information about distributed ledger technologies, including blockchain. This section also describes their characteristics and benefits.

Section 9 provides a detailed analysis of the management of identity, which is a fundamental asset of interaction and demonstrate how DLTs and blockchain technologies can solve many of the fundamental issues of usability and security currently encountered in existing identity management solutions.

Section 10 provides an outline of a solution that addresses the challenges described above.

Section 11 summarizes the conclusion of this study and outlines future works to be investigated.

6. Provenance

6.1. Provenance vocabulary standards

Provenance provides vital information for evaluating quality and trustworthiness of information on the Web. Therefore semantically interchangeable provenance information must be accessible and there must be an agreement on where and how this information is to be located [1]. The mission statement of the W3C Provenance Incubator Group stated that the provenance of information is critical to making determinations about whether information is trusted, how to integrate diverse information sources, and how to give credit to originators when reusing information. Broadly defined, provenance encompasses the initial sources of information used as well as any entity and process involved in producing a result. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable. People make trust judgements based on provenance that may or may not be explicitly offered to them.

Provenance is a well-established term in the context of art or digital libraries. In these cases, provenance respectively refers to the documented history of an art object, or the documentation of processes in a digital object’s life cycle [2]. The “e-science community” interest for provenance [3] is also growing, as it is considered as a crucial component of workflow systems [4] that can help scientists ensure reproducibility of their scientific analyses and processes. In law, the concept of provenance refers to the "chain of custody" or the paper trail of evidence. This concept logically extends to the documentation of the history of change of data in a knowledge system [5].

In the context of this study, the term provenance refers to information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability, or trustworthiness [6].

Around the year 2006, consensus began to emerge on the benefits of having a community-defined data model and uniform representation for “data provenance, process documentation, data derivation, and data annotation”, as stated in [7].

There are many areas of research and development that have studied relevant aspects of provenance. They can be classified into the following broad categories [8],[4],[7],[9]:

  • Interoperability for different provenance systems and tools to aid in the integration of provenance.

  • Information management infrastructure to manage growing volume of provenance data

  • Provenance analytics and visualization for mining and extracting knowledge from provenance data, which has been largely unexplored

  • Data provenance security and inference control

A series of challenges for provenance were launched [7]. The first Provenance Challenge [10] was to test the hypothesis that heterogeneous systems (mostly in the e-science/cyberinfrastructure space), each individually capable of producing provenance data by observing the execution of data-intensive processes, could successfully exchange such provenance observations with each other, without loss of information. The Open Provenance Model (OPM) [7] was proposed as a common data model for the experiment. The second challenge aimed at allowing disparate groups to gain a better understanding of the similarities, differences, core concepts, and common issues across systems. The third challenge aimed at exchanging provenance information encoded with OPM and providing additional profiles. The fourth challenge was to apply OPM to scenarios and demonstrate novel functionality that can only be achieved by the presence of an interoperable solution for provenance. Some of the approaches to address these challenges use Semantic Web technologies [11].

The notion of causal relationships, or dependencies, involving artifacts (e.g., data items), processes, and agents plays a central role in OPM. Using the OPM, one can assert that an artifact A was produced or consumed by a process P, such as “the orthoimage was produced by using a Digital Elevation Model M, and aerial image I and control points C using orthorectification algorithm P.” Here M, I, and C are artifacts, and P is a process. One can also assert a derivation dependency between two artifacts, A1 and A2, without mentioning any mediating process, i.e., “A2 was derived from A1.” Agents, including humans, software systems, etc., can be mentioned in OPM as process controllers. For example, “the orthorectificaton process was controlled by software S managed by agent X. OPM statements attempt to explain the existence of artifacts. Since such statements may reflect an incomplete view of the world, obtained from a specific perspective, the OPM adopts an open world assumption, whereby the statements are interpreted as correct but possibly incomplete knowledge: “A2 was derived from A1” asserts a certain derivation, but does not exclude that other, possibly unknown artifacts, in addition to A1,may have contributed to explaining the existence of A2. Other features of the OPM, including built-in rules for inference of new provenance facts, are described in detail in [10].

In September, 2009, the W3C Provenance Incubator Group was created. The group’s mission was to “provide a state-of-the art understanding and develop a roadmap in the area of provenance for Semantic Web technologies, development, and possible standardization.” The group produced its final report in December 2010 [12]. The report highlighted the importance of provenance for multiple application domains, outlined typical scenarios that would benefit from a rich provenance description, and summarized the state of the art from the literature, as well as in the Web technology available to support tools that exploit a future standard provenance model. As a result, the W3C Provenance Working Group was created in 2011. The group released its final recommendations for PROV in June 2013 [13].

The Core PROV-O standard defines the following core elements (see Figure 1) [13] (see ):

  • Entities: Physical, digital, conceptual, or other kinds of thing are called entities. Examples of such entities are a web page, a chart, and a spellchecker.

  • Activities: Activities generate new entities. For example, writing a document brings the document into existence, while revising the document brings a new version into existence. Activities also make use of entities.

  • Agents: An agent takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software, an inanimate object, an organization, or other entities that may be ascribed responsibility.

Figure 1. Prov-O Core Model

One of the advantages of PROV-O is that it is based on semantic web standards, thus provenance information can be interpreted by machines without ambiguity using well-defined semantics specified by the PROV-O ontology. The model can be extended, in principle, using the standard OWL extension mechanism (subclass, subproperty,and so forth) to address the needs of multiple disciplines. The PROV-O specification defines additional terms that extend the core concepts of the specification. Figure 2 depicts Entities as yellow ovals, Activities as blue rectangles, and Agents as orange pentagons. The domain of prov:atLocation (prov:Activity or prov:Entity or prov:Agent or prov:InstantaneousEvent) is not illustrated.

Prov O ExtendedModel
Figure 2. The expanded terms build upon those in the PROV-O core model.

Most of the provenance storage solutions are currently based on a database controlled by a central authority. This means that if the centralized authority is compromised, the data provenance can also be compromised and be tampered or destroyed. A reliable secure decentralized solution could be more suitable to address this issue.

6.2. Levels of Provenance and Resource Sharing

There are a number of studies that have investigated the role of automated workflows and published best practices to support workflow design, preservation, understandability, and reuse. A summary of the recommendations and their justifications has been summarized by Khan et al [14] by studying workflows from different domains (see Table 1). Their study classifies the recommendations into broad categories related to workflow design, retrospective provenance, the computational environment required/used for an analysis, and better findability and understandability of all shared resources. The recommendations have been informed by a wide corpus of literature [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [15], [25], [26], [27], [28]. Their findings were used to define the CWLProv model, which defines the provenance model for the Common Workflow Language (CWL).

Table 1. Summarized recommendations and justifications from the literature covering best practices on reproducibility, accessibility, interoperability, and portability of workflows cite:[khan2019sharing]
Requirement No. Category Recommendations Justifications


Retrospective Provenance

Save and share all parameters used for each software executed in a given workflow (including default values of parameters used).

Affects reproducibility of results because different inputs and configurations of the software can produce different results. Different versions of a tool might upgrade the default values of the parameters.


Prospective Provenance

Avoid manual processing of data, and if using shims, then make these part of the workflow to fully automate the computational process.

This ensures the complete capture of the computational process without broken links so that the analysis can be executed without the need for performing manual steps.


Data Sharing

Include intermediate results where possible when publishing an analysis.

Intermediate data products can be used to inspect and understand shared analysis when re-enactment is not possible.


Retrospective Provenance

Record the exact software versions used.

This is necessary for reproducibility of results because different software versions can produce different results.


Retrospective Provenance

If using public data (reference data, variant databases), then it is necessary to store and share the actual data versions used.

This is needed because different versions of data, e.g., human reference genome or variant databases, can result in slightly different results for the same workflow.


Prospective Provenance

Workflows should be well-described, annotated, and offer associated metadata. Annotations such as user-contributed tags and versions should be assigned to workflows and shared when publishing the workflows and associated results.

Metadata and annotations improve the understandability of the workflow, facilitate independent reuse by someone skilled in the field, make workflows more accessible, and hence promote the longevity of the workflows.


Findability and Understandability

Use and store stable identifiers for all artifacts including the workflow, the datasets, and the software components.

Identifiers play an important role in the discovery, citation, and accessibility of resources made available in open access repositories.


Execution Environment

Share the details of the computational environment.

Such details support analysis of requirements before any re-enactment or reproducibility is attempted.


Prospective Provenance

Share workflow specifications/descriptions used in the analysis.

The same workflow specifications can be used with different datasets, thereby supporting reusability.


Execution Environment

Aggregate the software with the analysis and share this when publishing a given analysis.

Making software available reduces dependence on third-party resources and as a result minimizes “workflow decay".


Data Sharing

Share raw data used in the analysis.

When someone wants to validate published results, availability of data supports verification of claims and hence establishes trust in the published analysis.


Retrospective Provenance

Store all attributions related to data resources and software systems used.

Accreditation supports proper citation of resources cite:used.


Retrospective Provenance

Workflows should be preserved along with the provenance trace of the data and results .

A provenance trace provides a historical view of the workflow enactment, enabling end users to better understand the analysis retrospectively.


Prospective Provenance

Data flow diagrams of the computational analysis using workflows should be provided .

These diagrams are easy to understand and provide a human-readable view of the workflow.


Findability and Understandability

Open source licensing for methods, software, code, workflows, and data should be adopted instead of proprietary resources.

This improves availability and legal reuse of the resources used in the original analysis, while restricted licenses would hinder reproducibility.


Findability and Understandability

Data, code, and all workflow steps should be shared in a format that others can easily understand, preferably in a system-neutral language.

System-neutral languages help achieve interoperability and make an analysis understandable.


Execution Environment

Promote easy execution of workflows without making significant changes to the underlying environment.

In addition to helping reproducibility, this enables adapting the analysis methods to other infrastructures and improves workflow portability.


Execution Environment

Information about compute and storage resources should be stored and shared as part of the workflow.

Such information can assist users in estimating the resources needed for an analysis and thereby reduce the amount of failed executions.


Data Sharing

Example input and sample output data should be preserved and published along with the workflow-based analysis.

This information enables more efficient test runs of an analysis to verify and understand the methods used.

These recommendations can be clustered into broad themes as shown in Figure 3.

Figure 3. Recommendations from Table 1 classified into categories.

6.3. Provenance Security

While considerable work has been done for provenance of workflow and documents, much less work has been done on securing the provenance information. Secure provenance is of paramount importance to federated cloud computing, yet it is still challenging today. Data provenance needs to be secured because it may contain sensitive/private information. Cloud service providers do not guarantee confidentiality of the data stored in dispersed geographical locations. Unless provenance information is secured and under appropriate access control policies for confidentiality and privacy, the information simply cannot be trusted. [29]. In the past few years, several studies have recognized the importance of securing provenance [29], [30], [31],[32].Lee and al. [31] provide a survey of the provenance security challenges in the cloud.

To guarantee the trustworthiness of data provenance, the data provenance scheme must satisfy the following general data security properties [32]:

  • Confidentiality: "Data provenance of a sensitive piece of data (that is, the source data) may reveal some private information. Therefore, it is necessary to encrypt not only the source data but also the data provenance. Moreover, a query to and/or a response from the data provenance store may reveal some sensitive information. Thus, both the query and its response must be encrypted in order to guarantee confidentiality on the communication channel. Last but not least, if data provenance is stored in the outsourced environment, such as the cloud then the data provenance scheme must guarantee that neither the stored information nor the query and response mechanism must reveal any sensitive information while storing data provenance or performing search operations" [32].

  • Integrity: "The data provenance is immutable. Therefore, the integrity must be ensured by preventing any kind of unauthorized modifications in order to get the trustworthy data provenance. The integrity guarantees that data provenance cannot be modified during the transmission or on the storage server without being detected" [32] .

  • Unforgeability: "An adversary may forge data provenance of the existing source data with fake data. Unforgeability refers to the source data being tightly coupled with its data provenance. In other words, an adversary cannot forge the fake data with existing data provenance (or vice versa) without being detected" [32].

  • Non-Repudiation: Once a user takes an action, as a consequence, the data provenance is generated. A user must not be able to deny the ownership of the action once data provenance has been recorded. The non-repudiation ensures that the user cannot deny his action if he/she has taken any actions [32].

  • Availability: "The data provenance and its corresponding source data might be critical" [32]. Therefore, the provenance of the data or the source data must be available at anytime from anywhere. "For instance, the life critical data of a patient is subject to high availability, considering emergency situations that can occur at any time. The availability of the data can be ensured by a public storage service such as provided by the cloud service provider" [32] or Content Addressable Storage (such as IPFS).

In other words, secure trustworthy provenance mechanisms ensure that the integrity of provenance chains are tamper evident, their contents are confidential, and auditors can verify their authenticity without having to know the contents.

Researchers have proposed several data provenance related efforts. The Provenance-Aware Storage System (PASS) was the first scheme to address the collection and maintenance of provenance data at the operation system level [33]. A file provenance system [34] was proposed to collect provenance data by intercepting file system calls below the virtual file system, which requires changes to operating systems. For cloud data provenance, S2Logger [35], was developed as an end to end data tracking tool which provides both file-level and block-level provenance in kernel space. In addition to data provenance techniques and tools, the security of provenance data and user privacy has also been explored. Asghar et al. [32] proposed a secure data provenance solution in the cloud, which adopts a two folder encryption method to improve privacy but at a higher computation cost. In SPROVE [36] provenance data confidentiality and integrity are protected using encryption and digital signature, but does not support provenance data querying capability. The kernel-level logging tool Progger [37] provides log tamper-evidence at the expense of user privacy. There are also efforts which use provenance data for managing cloud environments, such as, discovery of usage patterns for cloud resources, popularized resource reuse and fault management [38].

7. Workflows

Scientific workflows design and management have become increasingly popular for compute-intensive and data-intensive scientific applications. The vision and promise of scientific workflows includes rapid, easy workflow design, reuse, scalable execution, and other advantages, e.g., to facilitate “reproducible science” through provenance (e.g., data lineage) support [39]. However important research challenges still remain. There is an urgent need for a common format and standard to define workflows and enable sharing of analysis results provenance information using a workflow environment. Cloud systems are structured in a fundamentally different way from other distributed systems, such as grids, and therefore present new challenges for the collection of provenance data.

7.1. Workflow Management Services

The freezing and packaging of a runtime environment that includes all the software components and their dependencies used in data analysis workflow is considered today as best practices and has been widely adopted in cloud computing environments where images and snapshots are used and shared by researchers [40]. To distribute system-wide software, various lightweight and container-based virtualization and package managers have emerged such as Docker and Singularity.

Docker [41] is a lightweight container-based virtualization technology that facilitates the automation of application development by archiving software systems and environments to improve portability of the applications on many common platforms including Mac OS X, Linux, Microsoft Windows and cloud instances. One of Docker’s main features is the ability to find, download, deploy and run container images that were created by other developers quickly. Within the context of Docker, the place where images are stored is called a registry, and Docker Inc. offers a public registry called the Docker Hub. You can think of the registry along with the docker client as the equivalent of Node’s NPM, Perl’s CPAN or Ruby’s RubyGems.

Singularity [42] is a cross-platform open source container engine specifically supporting High Performance Computing (HPC) resources. Singularity can import Docker format software images. Singularity enables users to have full control of their environment. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data. This means that you do not have to ask your cluster admin to install anything for you - you can put it in a Singularity container and run.

The sharing and preservation of runtime environment packaging is becoming regular practice in the workflow domain and is supported today by all the leading platforms managing cloud infrastructure and computing services. These cloud providers include Digital Ocean [43], Amazon Elastic Compute Cloud [44], Google Cloud Platform [45] and Microsoft Azure [46]. The instances launched on these platforms can be saved as snapshots to be analyzed or can be recreated to restore the computing state at analysis time.

7.2. Workflow description standards

The Common Workflow Language (CWL) [47] has emerged as a workflow definition standard for the heterogeneous workflow environments. CWL is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and HPC environments. CWL is designed to meet the needs of data-intensive science, such as Bioinformatics, Medical Imaging, Astronomy, High Energy Physics, and Machine Learning. CWL is developed by a multi-vendor working group consisting of organizations and individuals aiming to enable scientists to share data analysis workflows [48]. CWL has been widely adopted by a large number of organizations and been implemented in a large set of open source tools and workflow management systems.

While a common standard for describing analysis workflow is an important step toward interoperability of workflow management systems, it is also important to share and publish results of these workflow executions in an transparent, comprehensive, interoperable and secure manner. This is essential to reduce duplication of workflows, improve the trust and integrity of data and analyses, improve reproducibility of scientific workflows and catalyze the discovery of new knowledge. Currently, there is no common format defined and agreed upon for interoperable and secure provenance, workflow archiving or sharing.

CWLProv [14] has been proposed very recently as a format to represent any workflow-based computational analysis to produce workflow output artifacts that satisfy various levels of provenance. CWLProv is based on open source community-driven standards, interoperable workflow definitions CWL, structured provenance using the W3C PROV Model [6], and resource aggregation and sharing as workflow-centric Research Object (RO) [16] generated along the final outputs of a given workflow enactment.

8. Blockchain Technologies

This section provides some background information about blockchain technologies. Blockchain is, in layman terms, a series of connected blocks that together form a chain. More technically, a blockchain is a shared, trusted and append-only ledger which contains transactions that have been made between users in the network. Blockchain is a combination of private key cryptography, peer-to-peer networking with an open ledger and incentivizing protocols. This ledger is distributed among participants in the peer-to-peer system where peers in the network store a copy of the ledger. The fact that the ledger is distributed throughout the network means that the peers have to reach consensus and agree on the order of the blocks. This is critical since it is essential that every peer in the network has the same view of the blockchain.

8.1. Governance in DLTs, Blockchain, Hybrid Blockchain

Governance in DLTs, Blockchain and Hybrid Blockchains can be described as follows [49]:

  • Distributed Ledger Technology (DLT): "Governance in DLTs is mostly centralized in one or a few validator nodes that are identified and other nodes that might have read access with the permission of the validator nodes. Governance is closed and the network of nodes is permissioned and new nodes can only join with permission from the validator nodes" [49].

  • Hybrid Blockchains: "Governance in a semi-public or public permissioned Blockchain and is often defined by the validators nodes that are identified, usually, public institutions like governments agencies, educational institutions or corporates. However, read access to the ledger is open to everyone, which is different from DLTs where you need to be invited to have read access" [49].

  • Blockchain: "The governance in a blockchain such as Bitcoin is decentralized through the global distribution of all nodes each having all the data of the blockchain, the free software that allows anyone to participate in “Bitcoin” and the relative decentralization of mining (which has become more centralized over time) to reach a consensus on the truth of the blockchain. In Bitcoin and similar blockchain efforts everything is aimed at maximizing decentralization" [49].

8.2. Permissioned vs. Permissionless and Private versus Public

To meet the different requirements for applications and companies, in the recent years, different types of blockchains have been developed. Data stored in the blockchain can vary in importance and sensibility and thus it is important to control the access to this information. For example, Bitcoin blockchain is a public and permissionless blockchain that allows anyone to read and send transactions to the network. A permissionless blockchain means that everyone can contribute to the consensus process to validate new block of transactions to be added to the blockchain.

Open networks must use proof-based algorithms to establish trust because members of the network are inherently untrustworthy. Thus, by providing proof that is acceptable to a majority of the remaining network, nodes can be added to the list. Since the proof might be computationally expensive, cryptocurrencies provide the incentive of receiving coins in the currency appropriate to the work.

If the primary motivation of such a costly proof scheme is to create trust between untrustworthy parties, one obvious alternative is to work only with trustworthy nodes. Permissioned networks are composed of nodes identified by cryptographic keys. Other members of the network provide the permission to these nodes to join the network. Consensus is reached in this situation by simply making sure that the source and purpose of the transaction are valid between some parties on the network. There is no need of incentive to check transactions. The simple checking of membership of the parties in the network is sufficient. This improves performance and alleviates any concern about the work, stake, or cost required to participate. Example of permissioned blockchain is the open source Hyperledger Fabric distributed ledger software.

8.3. Hashing

To maintain the reliability and integrity of the blockchain and avoid the recording fraudulent data or double spending transactions, the technology relies on one of its key component: hashing. Hashing is a mathematical algorithm (such as SHA-256 used by Bitcoin) that takes an input of any length and turns it into a cryptographic fixed output. Examples of such inputs can include a short piece of information such as a message or smart contracts, a block of transactions or an address of a content in a content-addressable system such as IPFS.

8.4. Consensus

Consensus protocols are one of the most important and revolutionary aspects of blockchain technology. These protocols create an irrefutable system of agreement between various devices across a distributed network, whilst preventing exploitation of the system.

Blockchain consensus protocols are what keep all the nodes on a network synchronized with each other, while providing an answer to the question: How do we all make sure that we agree on what the truth is?

‘Consensus’ means that the nodes on the network agree on the same state of the blockchain. This, in a sense makes it a self-auditing ecosystem. This is an absolutely crucial aspect of the technology, carrying out two key functions. Firstly, consensus protocols allow a blockchain to be updated, while ensuring that every block in the chain is true and in many cases keeping participants incentivized. Secondly, it prevents any single entity from controlling or derailing the whole blockchain system. The aim of consensus rules is to guarantee a single chain is used and followed.

8.5. IPFS

The InterPlanetary File System (IPFS) is likely the foremost P2P file system at the moment, represents the state of the art in the quest for a distributed web. IPFS takes ideas from previous P2P systems such as distributed hash tables (DHTs), BitTorrent, Git, and and Self-Certified Filesystems (SFS), and tries to simplify those ideas and take them even farther. Nodes in an IPFS network store objects (files and other data structures) in local storage and connect to each other to transfer these objects [50].

Nodes in the IPFS network are identified by a "NodeId", which is the cryptographic hash of a public key. When a node in the network requests a file or other network objects, the network’s routing system finds peers who can serve the requested objects (known as "blocks"), gets the corresponding network addresses, and connects the requester to the discovered peers. When two peers first connect, they exchange public keys and check to make sure the connection is secure. If the check fails, the connection ends. Once the peers have connected, they use an IPFS protocol called BitSwap, which is based on BitTorrent, to "barter" and exchange blocks. This bartering process is meant to prevent freeloader nodes from exploiting the file system. Once the exchange is over, the requester node now has full copies of the blocks it received and can then share them with the next requester [50].

IPFS uses a Mergle DAG (directed acyclic graph) where links between objects are cryptographic hashes of the content the link refers to. This means that all content in the IPFS, including links, is identified by its hash checksum. This, in turn, helps prevent data tampering and eliminates data duplication. It also means that unlike HTTP, which is location-addressed, IPFS is content-addressed.

Objects in IPFS are immutable and permanent, and Git-like versioning is built into the system. Old versions of objects, like every other object in IPFS, can simply be retrieved by their hash checksum.

Although its underlying store is immutable and content-addressed, IPFS does support mutable paths through a decentralized naming system called InterPlanetary Name Space (IPNS). IPNS takes advantage of the mutable state routing system in IPFS to store object hashes as metadata values that point to objects, can be changed, and, if the end user so chooses, can also point to previous versions of the same object. One drawback of the IPNS system is that it does not result in human friendly paths, so another layer needs to be added on top of IPNS if human friendly paths are desired.

8.6. BitTorrent File System (BTFS)

BitTorrent File System (BTFS) [51] is both a protocol and network implementation that provides a content-addressable, peer-to-peer mechanism for storing and sharing digital content in a decentralized file system using BitTorrent protocol. BTFS provides a foundation platform for Decentralized Applications, known as DApps.

BitTorrent, the largest P2P network in the world, still relies on centralized torrent file distribution. These torrent repositories are prone to security breaches, outages, and censorship. There have been numerous instances of attacks on torrent hosting web servers reducing service reliability. With a decentralized repository of torrent files leveraging the version control properties of BTFS, users can more reliably access torrent files.

The BitTorrent File System (BTFS), created by the makers of BitTorrent, is an upcoming distributed file system implementation that began as a fork of the IPFS implementation. BTFS promises to take what IPFS has accomplished and present an improved version that is more ready for widespread adoption. The makers of BTFS plan to integrate the existing BitTorrent P2P file exchange system into BTFS, which could help BTFS gain the wide use that IPFS has largely failed to attain [51].

By leveraging the massive existing infrastructure of BitTorrent user nodes (close to 100 million), BTFS is aimed at becoming the largest distributed storage network as well as the world’s largest distributed media sharing network [51].

One of the primary advantages BTFS claims to have over IPFS is its integration of native token economics. These tokens are intended as an incentive for network nodes to contribute by storing data and not simply leeching of existing nodes. BTFS is also set to release with a set of developer tools that will make working with the file system more user friendly than other alternatives. BTFS claims a public version of the system will be available by 2020.[51]

8.7. Hyperledger Fabric

Hyperledger Fabric [52] is an open source enterprise-grade permissioned DLT platform, designed for use in enterprise contexts, that delivers some key differentiating capabilities over other popular distributed ledger or blockchain platforms. Hyperledger Fabric is maintained by the Linux Foundation and evangelized by IBM. Fabric is currently one of the most widely adopted blockchain technologies by the biggest enterprises, much more so than the “big name” blockchains. Companies such as Oracle, Walmart, Airbus, Accenture, Daimler, Thales, The National Association of Realtors, Deutsche Borse Group, and Sony Global Education are all members of the community that develops and maintains Fabric.

One key point of differentiation is that Hyperledger was established under the Linux Foundation, which itself has a long and very successful history of nurturing open source projects under open governance that grow strong sustaining communities and thriving ecosystems. Hyperledger is governed by a diverse technical steering committee, and the Hyperledger Fabric project by a diverse set of maintainers from multiple organizations. Fabric has a development community that has grown to over 35 organizations with nearly 200 developers since its earliest commits.

Fabric has a highly modular and configurable architecture, enabling innovation, versatility and optimization for a broad range of industry use cases including banking, finance, insurance, healthcare, human resources, supply chain and digital product delivery.

"Fabric is the first distributed ledger platform to support smart contracts authored in general-purpose programming languages such as Java, Go and Node.js, rather than constrained domain-specific languages (DSL) and uses Docker container technology for deployment of smart contracts (what Fabric calls “chaincode”)" [52]. This means that most enterprises already have the skill set needed to develop smart contracts with no additional training to learn a new language or DSL is needed.

The Fabric platform is also permissioned, meaning that, unlike with a public permissionless network, the participants are known to each other, rather than anonymous and therefore fully untrusted. This means that while the participants may not fully trust one another (they may, for example, be competitors in the same industry), a network can be operated under a governance model that is built off the trust that does exist between participants, such as a legal agreement or framework for handling disputes.

One of the most important of the platform’s differentiators is its support for pluggable consensus protocols that enable the platform to be more effectively customized to fit particular use cases and trust models. For instance, when deployed within a single enterprise, or operated by a trusted authority, fully byzantine fault tolerant consensus might be considered unnecessary and an excessive drag on performance and throughput.

Fabric can leverage consensus protocols that do not require a native cryptocurrency to incent costly mining or to fuel smart contract execution. Avoidance of a cryptocurrency reduces some significant risk/attack vectors, and absence of cryptographic mining operations means that the platform can be deployed with roughly the same operational cost as any other distributed system.

Hyperledger Fabric delivers a uniquely elastic and extensible architecture, distinguishing it from alternative blockchain solutions. The combination of these differentiating design features makes Fabric one of the best distributed ledger platforms available today both in terms of high degrees of confidentiality, resiliency, flexibility and scalability.

8.8. Smart Contracts

The concept of a Smart Contract was introduced in 1994 by Nick Szabo [53], a legal scholar, and cryptographer. Szabo came up with the idea that a decentralized ledger could be used for smart contracts, otherwise called self-executing contracts, blockchain contracts, or digital contracts. In this format, contracts could be converted to computer code, stored and replicated on the system and supervised by the network of computers that run the blockchain. Contracts would be activated automatically when certain conditions are met.

A smart contract, or simply a contract, in the context of blockchain is a small piece of code that is executed in response to a transaction. Business logic executed on the network is done through contracts. Contracts can have many uses, although some ledgers may limit the type of code that can be executed for either architectural reasons or security. Smart contracts are automatically executable lines of code that are stored on a blockchain which contain predetermined rules. When these rules are met, the code executes on its own and provides the output. In the simplest form, smart contracts are programs that run according to the format that has been set up by their creator. Smart contracts are most beneficial in business collaborations in which they are used to agree upon the decided terms set up by the consent of both the parties. This reduces the risk of fraud, as there is no third-party involved. The costs are also reduced.

In the context of blockchain, a smart contract is defined as machine processable business agreement embedded into the transaction database and executed with transactions. The function of the contract is to define the rules defining the flow of value and state of a transaction needed in business transaction. The contract is smart because it executes the terms of contract using computerized protocol. The core idea behind smart contracts is to codify various contractual clauses (such as acceptance criteria, delineation of property rights, and so forth) to enforce compliance with the terms of the contract and ensure a successful transaction. Smart contracts are designed to guarantee one party that the other will fulfill their agreement. One of the benefits of smart contracts is to reduce the costs of verification and enforcement. Smart contracts are required to be observable (meaning that participants can see or prove each other’s actions pertaining to the contract), verifiable (meaning that participants can prove to other nodes that a contract has been performed or breached), and private (meaning that knowledge of the contents/performance of the contract should involve only the necessary participants required to execute it). Bitcoin has basic support for smart contracts. However, it lacks some essential capabilities such as Turing-completeness, lack of state, and so on.

Launched in 2015, Ethereum blockchain is the world’s leading public programmable blockchain supporting Smart contracts. It replaces Bitcoin‘s more restrictive script language with one that enables developers to build their own decentralized applications. On this platform, smart contracts are implemented with the Solidity scripting language, which is Turing Complete. Solidity is currently the most prominent public smart contracts framework that allows anyone to write smart contracts and decentralized applications by creating their own arbitrary rules for ownership, transaction formats, and state transition functions.

In the open source Hyperledger Fabric, smart contracts, a.k.a. ChainCode program can be written in Go, node.js, or Java. Chaincode is installed on peers and require access to the asset states to perform reads and writes. Chaincode runs in a secured Docker container isolated from the endorsing peer process. The chaincode is then instantiated on specific channels for specific peers. Chaincode initializes and manages the ledger state through transactions submitted by applications. A chaincode typically handles business logic agreed to by members of the network, so it similar to a “smart contract”. A chaincode can be invoked to update or query the ledger in a proposal transaction. Given the appropriate permission, a chaincode may invoke another chaincode, either in the same channel or in different channels, to access its state. Note that, if the called chaincode is on a different channel from the calling chaincode, only read query is allowed. That is, the called chaincode on a different channel is only a Query, which does not participate in state validation checks in subsequent commit phase.

Smart contracts are one of the most successful applications of blockchain technology. Using smart contracts in place of traditional ones can reduce the transaction costs and trusted intermediaries significantly and improve automation and security. They are also tamper-proof, as no one can change what has been programmed.

9. Identity Management

Data in Clouds is geographically dispersed and is frequently accessed by a number of actors. Actors are active elements inside or outside the network, including cloud services, peers, client applications, administrators, and so forth. In such a shared and distributed environment, data moves from one point to another through communication networks. The number of data transactions increases as the number of users and volume of data increases. The growing interactions with this dispersed data increases the chances of lost data, data alteration and/or unauthorized access.

To determine the exact permissions over the resource and access to information of the ledger, digital identity plays a central role. Identity is the fundamental asset of how parties are interacting, thus playing a central role in securing access and ensuring the integrity of information. Digital Identity for people, organization, devices and secure, encrypted, privacy-preserving storage and computation of data are critical and play a central role in the establishment of the Web of Trust. This section describes the evolution of online identity models from centralized toward Self-Sovereign Identity (SSI), explains Public Key Infrastructures (PKIs) and demonstrate how DLTs and emerging SSI standards can be used as a foundation of an Identity Layer for Internet and Web of Trust based on Decentralized Public Key Infrastructure (DPKI).

9.1. Evolution of Online Identity Model

The Internet was built without an Identity Layer. That is, there was no standard way of identifying people and organizations. The addressing system was based solely on identifying machine physical endpoints, not people or organizations. In his excellent article “The Path to Self-Sovereign Identity” [54], Christopher Allen provides a clear analysis of the online identity landscape and describes the evolutionary path composed of 4 broad stages since the advent of the Internet: Centralized identity, Federated identity, User-Centric identity, and Self-Sovereign identity (see Figure 4).

ssi evolution
Figure 4. The evolution of online identity

The evolution of internet identity is the result of trying to satisfy three basic requirements [55]:

  1. Security - the identity information must be protected from unintentional disclosure;

  2. Control - the identity owner must be in control of who can see and access their data and for what purposes;

  3. Portability - the user must be able to use their identity data wherever they want and not be tied into a single provider.

The following describes the different evolution stages of identity model in more detail.

9.1.1. Centralized Identity

Due to the lack of identity layer on Internet, several web sites started to provide their own identity management by providing username/password-protected accounts. These systems were centralized, owned and controlled by a single entity, such as an eCommerce website or a social network (Google, Facebook). The user does not own their identity record as it can be taken away at any time due to policy violations, web site shutdown or censorship for example. This effectively erases a person’s online identity, which may have been used for years and may be of significant value to them, and impossible to replace.

As the Internet expands rapidly to new types of devices (Internet of Things), online web sites and services, the maintenance of identity on every endpoint becomes unsustainable. Users are requested to provide the same identification information over and over again, remember the plethora of user accounts and passwords can be subject to hacking and data breaches if private information is not secured enough.

9.1.2. Federated Identity

To address some of the problems of centralization, the usage of federated identity management provides a degree of portability to a centralized identity. This is done by enabling a user to login into one service using the credentials of another (e.g. Facebook or Google account in the consumer internet). At a more complex level, it can allow different services to share details about the user. Federation is common within large businesses, where single sign-on mechanisms allow a user to access multiple separate internal services such as Human Resources, accounting, etc., with a single username and password. Although federation provides a semblance of portability, the concentrated control of management of identities to a small number of corporations increases by order of magnitude the risk of hacking. Further, the implications to a user of having their centrally federated account deleted or compromised are much more profound if that account is their key to many other 3rd party services [55]. In addition, the cost and growing economic inefficiency to collect, store, and protect personal data is reaching a tipping point today.

9.1.3. User-Centric Identity

The Identity Commons (2001-Present) established to consolidate new work on digital identity with a focus on decentralization. Their most important contribution may have been the creation, in association with the Identity Gang, of the Internet Identity Workshop (IIW) (2005-Present) working group. The IIW community focused on a new term that countered the server-centric model of centralized authorities: user-centric identity. The term suggests that users are placed in the middle of the identity process. The work of the IIW has supported many new methods for creating digital identity, including OpenID (2005), OpenID 2.0 (2006), OpenID Connect (2014), OAuth (2010), and FIDO (2013). As implemented, user-centric methodologies tend to focus on two elements: user consent and interoperability. By adopting these elements, a user can decide to share an identity from one service to another and thus de-balkanize their digital self [54].

User-centric identity is most frequently manifested in the form of independent personal data stores at one end of the spectrum, and large social networks at the other end. However the entire spectrum still relies on the user selecting an individual identity provider and agreeing to their often one-sided adhesion contracts [55]. The user-centric identity communities intended to give users complete control of their digital identities. Unfortunately, powerful institutions such as Facebook or Google co-opted their efforts and kept them from fully realizing their goals. Much as with the Liberty Alliance, final ownership of user-centric identities today remains with the entities that register them [54].

OpenID offers an example. A user can theoretically register his own OpenID, which can then use autonomously. However, this takes some technical know-how, so the casual Internet user is more likely to use an OpenID from one public web site as a login for another. Users are at the mercy of selecting a long-lived and trustworthy site to gain many of the advantages of a self-sovereign identity, which could take it away at any time by the registering entity.

Facebook Connect (2008) appeared a few years after OpenID, leveraging lessons learned, and thus was several times more successful largely due to a better user interface. Unfortunately, Facebook Connect veers even further from the original user-centric ideal of user control, not only because they are the only provider but also because Facebook has a history of arbitrarily closing accounts, as was seen in their recent real-name controversy [56]. As a result, people using their “user-centric” Facebook Connect identity to connect to other sites may be even more vulnerable than OpenID users to losing that identity in multiple places at one time [54].

9.1.4. Self-Sovereign Identity

Self-sovereign identity (SSI) is the last step in the digital identity evolution. SSI is independent from any individual silo, and provides all three required elements: individual control, security, and full portability. SSI addresses the problems of the centralized external controls from the three previous phases. The individual (or organization) to whom the identity pertains completely owns, controls and manages their identity. There is no external party who can claim to “provide” the identity for them or to take away their identity because it is intrinsically theirs. The individual’s digital existence is independent of any single organization [55].

In a 2016 article [57], Phil Windley describes self-sovereign identity as an “Internet for identity” which, like the Internet itself, has three virtues: no one owns it, everyone can use it, and anyone can improve it. Similar to the internet, the move to self-sovereign identity is a move from a silo mentality to a layer mentality. Instead of having every organization maintaining their own siloed identity information store with possibly a suite of APIs to connect to other such silos, each organization can have one single connection to the Internet’s identity layer, and immediately benefit from all the organizations that are already present.

Figure 5 illustrates the migration from centralized to decentralized peer-to-peer self-sovereign identity. In the current identity model, every identity is given by an organization. For example, to do business with Amazon, you need to create an account to use their service and you are subject to the terms and conditions to use their service. This can be changed retroactively and the account can be taken down at any time if you violate their terms. To change the identity model, people need to have full control of their identities, which basically turn the centralized model inside-out, resulting in a self-sovereign identity model. Each organization and person establishes their identities independently and is seen as a peer in a peer-to-peer network that is decentralized and not controlled by anyone.